Notes from the Montreal Linux Power Management Mini-Summit

[Posted August 3, 2009 by corbet]

From:		Len Brown <lenb-AT-kernel.org>
To:		Linux Power Management List <linux-pm-AT-lists.osdl.org>, linux-acpi-AT-vger.kernel.org
Subject:		Montreal Linux Power Management Mini-Summit, July 13, 2009 - Meeting Notes
Date:		Thu, 30 Jul 2009 18:04:40 -0400 (EDT)
Cc:		Linux Kernel Mailing List <linux-kernel-AT-vger.kernel.org>

A Linux Power Management "mini-summit" was held on July 13th, 2009 -
on the first day of the Montreal Linux Symposium.

The Linux Symposium generously provided the facilities.

We repeated the process used in 2008: http://lwn.net/Articles/292447/

This year the meeting room was more accessible to the general attendees
of the Linux Symposium, so we had a fair number of "drop-ins".
25 signed in (listed below) plus a few more that came and went.
While this exceeded our cap of 20, the extra people did not hinder
our goal of focusing on a single discussion.

Attendees
---------
Len Brown - Intel - ACPI, SFI, Suspend co-Maintainer
Howard Alyne - Wind River
Pierre Phaneuf
Rafael J. Wysocki - SUSE Labs/Novell, U. Warsaw;  Hibernate and Suspend Maintainer
Per-Inge Tallberg - Ericsson
Rickard Andersson - Ericsson
Paul Mundt - Renesas - SH Maintainer
Magnus Damm - Renesas
Richard Wooodruff - Texas Instruments, OMAP
Stephen Hui - Zarlink
John Linville - Red Hat - Wireless LAN maintainer
Mark Brown - Marvell
Samuel Thibault - labri.fr
Lucas Nussbaun - inria.fr
Srinivas Sripathi - Motorola
Jason Baron - Red Hat
Aristu Rozanaski - Red Hat - RHEL6 kernel maintainer
Christopher Curtis - RipTide Software
Klaus Pedersen - Nokia
H. Peter Anvin - Intel - x86 maintainer
Ernest Szedeman - Nortel
Rick Leir - Leirtech
David Ahern - Cisco
Wending Wen - Rheinmetall
Jason Chagas - Marvell

Some of the attendees are in photos here:
http://picasaweb.google.com/lenb417/2009LinuxSymposium#

Agenda
------
	1. Review changes over the last year
	2. Survey tools, techniques, workloads
	3. Discuss upcoming work

Summary of Power Management kernel changes since last year
----------------------------------------------------------
ACPI Platform BIOS compatibility fixes
	ACPI ACPI_SCI_EN work-around
	resume memory corruption workarounds

hibernation:
	NVS memory handling
	handle overlapping memory zones

suspend/resume framework re-work (Rafael Wysocki)
	shipped suspend/resume RTC test feature
	ordering update/workaround
	simplified driver interface now available
		r8169 etc. drivers now using it
	PCI PM framework re-worked to simplify drivers
	graphics drivers better support suspend/resume
		i915 video restore, though has bugs
		ATI making progress, especially older cards
		NVIDIA - continues to trail
			no open source support for devices after 7200
power aware scheduling
	sched_mc_power_savings
	per-CPU timers fixed
clock_events_broadcast()
	bugs fixed
	(no longer needed on Westmere, which has always running LAPIC timer)
range timers shipped upstream
	eg. range timers used android to group around wireless

Intel shipped Nehalem (Core i7), which has always-running-TSC

Run Time power management is receiving some attention now.

OMAP (Richard Woodruff)
	2008 had TI releasing aggressive full-off reference code on public portals
		Customers snapshotted this code at different points
		Heavy support burden ramping variants into production
	Linux-OMAP community have been creating a cleaner version of aggressive PM
	code suitable for mainline kernel in Linux-OMAP PM branch.
		Hope of reduced burden for future kernels with mainlined code

ACPI sub-system (Len Brown)
	quality has been the focus for the last year.
	We continue to process about 300 bugs/year
	with 50-60 unresolved at any given time.

Wireless: (John Linville)
	mac-80211 is now suspend/resume aware
	IEEE-80211 has run-time power saving features
		eg. negotiate w/ access point
		starting to deploy in drivers
	beacon filtering (reduces CPU wake-ups)
	TX power upcoming in cfg-80211 API
		Nokia tablets pushing power savings

SH: (Paul Mundt)
	cpuidle integration
	using clocksources & clockevents from upstream
	can switch between timers depending on sleep states
	Hibernate & STR enabled, can test w/ RTC & kexec-jump-and-return

s390:
	added suspend/resume support

5-second boot on Atom netbook for Moblin
	async API is upstream
	Fedora Core-11 boots in 20 seconds on a notebook
		Down from 60 seconds in Fedora Core-10

PM-QOS shipped
	Documentation/power/pm_qos_interface.txt

Survey of Tools, Techniques, workloads for optimizing power management
----------------------------------------------------------------------
powertop
bootchart
bootgraph
CONFIG_POWER_TRACER=y
	LTT-lite
performance counters for energy coming
OMAP uses on-board instrumentation
suspend/resume debug I/F
Power meters:
O(100) Watts Up Pro; O(600) Extech; O(1000) Yokogawa
O(600) HP/Agilent 34401A
OMAP: measure per-power-plane w/ lab instruments
500mA vs uA range difficult to measure w/ precision
	multi-channel DAC - each channel calibrated to range

Workloads for measuring power:

handheld: no standard workloads
	however device vendors have internal benchmarks
	#1 idle
	#2 specific workloads
	#3 combination use-case

SpecPower benchmark for servers (only)

Energy Star for client computers
	idle only
	requires STR to be enabled by default
	Energy Star Server spec coming
	Future Energy Star wants to use energy benchmark
BAPCO
	MobileMark 2007 for Windows
	Apple joined, so expect something new to work also on Apple
	No Linux Distro representation
EEMBC
	released something or other...
BLTK (Battery Life Toolkit) for Linux
	http://www.lesswatts.org/projects/bltk/
	could use refresh
	could use handheld new workloads

Future plans for the PM development, kernel side
-------------------------------------------------
cpuidle C-states generalized to be platform idle states...
	platform driver can hide platform hooks into CPU power states

Runtime PM for Platform Devices.
	2.6.32 framework plan simmering
	SH running on top of prototype now
		context save/restore for power off power domain

	platform devices
		SH specific - Magnus
	IO devices
		eg PCI, USB - Alan Stern

	clock framework (started in ARM, now common on embedded)
		includes ref-counts/clock
		architecture specific implementation
		x86/ACPI system doesn't expose clock dependencies
			so unclear benefit to that arch

Run-time PM of I/O devices, from the PCI POV mostly
	ability to put device into D1/D2 (~200us) /D3 (10ms)

	wakeup: PCIe #PME plug-event via root port
	(PCI #PME is less well specified)

ACPI 4.0 adds D3hot
	Q: has an effect on _SD3?

Hibernate/suspend:

Axiom: we need more people fixing suspend/resume bugs

Suspend2 aka "Tux on Ice"
	Spring 2009 patch set to replace hibernate w/ TOI was
	deemed impractical by upstream community, which prefers
	an incremental approach.

	Since, Nigel has sent specific patches to Rafael along
	the lines of gradual cherry-picking that upstream needs.

	First example is patch to compress hibernation image
	which Rafael thinks can be integrated.

	TOI is able to save larger hibernate images due to
	how it manages memory.  This is a nice benefit and
	we'd like to see if we can do it upstream.

	patch review bandwidth limited
	
	1. image compression
	2. image saving performance
		currently very slow
	3. ability to use multiple devices to save images
		including multiple swaps, and regular files
	4. break the half-of-memory image limitation
	5. Image encryption (solution for keys is an issue)

	It would be great to have Nigel supporting upstream hibernate.

	TOI supports snapshot boot via "kiosk mode"

Hibernate & kexec
	kexec-jump is upstream (i386, SH, no x86_64)
		simplifies memory management of the "jumped to" code
		unclear if any other advantages.

kexec-crash-dump is useful
	can make an oops "look less scary" and be automatic

STR performance
	eliminate console switch
	async device resume

android submitted "auto-suspend" patches
	compromise between low-level and high-level suspend invocation policy.

cpuidle vs auto-suspend
	suspend is more "draconian", it stops timers etc for you.
	platform drivers in cpuidle can get to same place.

Android
	OHA -Open Handset Alliance
		controls android license(s)
		Android = access to app-store
Moblin
	shall support Android applications

OMAP & SH specifics

	UIO - user space codec etc. have no concept of PM
		could use clock framework extension
		(clock framework is accessible via debugfs if necessary)
	interrupt coalescing
	deferred I/O to LCD
		delay until regular (infrequent) update interval
		use x-damage API to track change to visible screen

SH running cpufreq on top of clock framework
	cpufreq has notifiers, clock framework does not

lightweight CPU hotplug
	IBM proposed "idle throttling" approach using scheduler
	Intel is proposing simple "forced idle" RT thread
	PeterZ likes neither implementation, but
	favors the IBM approach in the long term.

	SH SMP wants to run Itron on some cores...
	low latency transition is important

Memory Power Management
	Nokia project w/ U. in Brazil
		more pain than gain in memory offline prototype
	"partial RAM self refresh"

	page tables for kernel memory would allow
		moving kernel physical memory

	memory off-line incompatible with high-performance interleaving

	using NUMA node to segment memory allows tracking
		unused memory
	anti-fragmentation went upstream last year

	consensus: online/offline
		node granularity only

ACPI 4.0 was published
	Error Reporting extensions
	processor aggregator device (forced idle to save power)
	D3hot
	generalized fan support
	thermal extensions
	IPMI op-region

	Len will do a Linux ACPI 4.0 presentation this Fall

virtualization power management
	PM is still an after-though in the VMM space
		they have bigger problems

	KVM gets everything in Linux for free
		but could benefit from more info from the guests

	Xen gets to re-invent/port/re-implement everything in Linux

	VMMS have an easier time moving physical pages
		and thus doing memory power management

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 3, 2009 16:06 UTC (Mon) by ssam (guest, #46587) [Link] (6 responses)

in a power management talk a while back (maybe at lugradio) the idea of automatically adjusting the backlight dimming timeout was raised. it sounded great but i don't think it has ever been implemented.

the idea was that you start with a timeout of say 30 sec. if the user wiggles the mouse within a few seconds of the screen blanking then you increase the timeout. otherwise reduce the timeout.

that way on days when i am staring a big chunks of code, and not scrolling, if the screen goes blank i just wiggle the mouse, and automatically get a longer timeout.

on days when i am reading a magazine, but occasionally wake the screen to check something, it can blank faster.

"adaptive backlight dimming" ... brilliant!

Posted Aug 4, 2009 3:54 UTC (Tue) by jabby (guest, #2648) [Link]

This is brilliant and should be relatively easy to implement and stick behind a single configuration option called something like "smart backlight dimming".

Maybe it will be my first kernel hacking project if no one else jumps on it... Please, someone jump on it! You don't want my first foray into kernel code to be controlling your screen... ;o)

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 4, 2009 6:06 UTC (Tue) by dkite (guest, #4577) [Link] (1 responses)

I've got arch linux - xfce on my eeepc, and it does that. Dims the
backlight after a certain time of inactivity.

Derek

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 4, 2009 8:40 UTC (Tue) by muwlgr (guest, #35359) [Link]

Yes, but the idea is about adaptive inactivity period, determined by pauses in current user's action.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 4, 2009 14:07 UTC (Tue) by zdzichu (subscriber, #17118) [Link]

This is work for userspace, not kernel. Also see http://mjg59.livejournal.com/106581.html

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 4, 2009 23:34 UTC (Tue) by pimlottc (guest, #44833) [Link] (1 responses)

To be honest, I've never understood the idea of backlight dimming in response to activity. If I'm using the computer, I want it to be at full (subject to preference/context) brightness. If I'm not using it, it might as well be off.

Perhaps there are more use cases for people who need to see what's on the screen without typing/mousing for long periods, but I would expect that to be niche.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 5, 2009 0:15 UTC (Wed) by foom (subscriber, #14868) [Link]

The only reason I like backlight dimming is that it is less abrupt and disruptive than simply turning off the screen. So if I am actually reading the screen but not interacting, the dim is a more subtle signal to twiddle the mouse than the screen suddenly going blank.

That said, I prefer the behavior Gnome has these days: when you leave it alone, it starts slowing fading to black over 10 seconds or so. That seems much better than having a drawn-out dimmed screen period.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 3, 2009 21:09 UTC (Mon) by shapr (subscriber, #9077) [Link] (9 responses)

s390: added suspend/resume support

Wait, what? They make s390 notebooks?

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 3, 2009 21:32 UTC (Mon) by elanthis (guest, #6227) [Link] (8 responses)

Power saving isn't just about netbooks.

My desktop does suspend/hibernate, for example, so I can leave my machine powered off over night while still having my full session restored when I start it up in the morning.

For big power-hungry workstations, power saving is even more critical. The power consumed by workstations and servers is one of the biggest (if not the biggest) expensive a large ISP/IT Department has to deal with.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 3, 2009 22:13 UTC (Mon) by drag (guest, #31333) [Link] (2 responses)

I doubt it's mostly for power savings.

Remember that typically with mainframe applications they charge money based on MIPS cycles. So that the more processor you have the more everything costs. So in a efficiently running mainframe environment with proper setup and accounting you should be running at about 100% cpu 24/7 in order to get the best value.

They are not like PCs were you have the user or I/O as a bottleneck and the CPU spends most of it's time idle... Mainframes tend to have massive amount of I/O and relatively little CPU.

I would still like to have suspend-to-disk capabilities in a mainframe environment however. For various hardware issues and whatnot you do need to plan for downtime occasionally. By being able to suspend the Linux systems to disk then that reduces the downtime. Instead of needlessly wasting CPU time booting up and initializing the system you just load up the memory snapshot, which should be almost always much faster in a system like that.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 3, 2009 22:43 UTC (Mon) by ewan (guest, #5533) [Link] (1 responses)

There is also a fighting that someone just did it because it's cool.

Notes from the Montreal Linux Power Management Mini-Summit

Posted Aug 4, 2009 16:09 UTC (Tue) by ewan (guest, #5533) [Link]

Um. 'Fighting chance', that is. Drat.

OT: Biggest expense

Posted Aug 4, 2009 14:21 UTC (Tue) by man_ls (guest, #15091) [Link] (4 responses)

What? I would have sworn that making the payroll was a bigger expense than power costs. Especially in organizations with lots of development going around. Thing is, I never paid attention to energy costs so it may well be. Just curious: is it an impression of yours, or do you have numbers?

OT: Biggest expense

Posted Aug 4, 2009 20:41 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

we had a discussion on this on the lopsa mailing list and found that the cost breakdown still seems to be

people
servers
power

even allowing for 2x power consumption (to cover cooling, etc) servers on a 3 or so year replacement cycle would still cost more than the power they consume over that time (assuming max power draw the entire time)

power is a significant cost, and since it shows up as a single line item it jumps out at people, but it's still not as bad as people are making it out to be.

OT: Biggest expense

Posted Aug 4, 2009 22:09 UTC (Tue) by man_ls (guest, #15091) [Link] (2 responses)

It makes a lot of sense. Maybe at Google things are different, due to a couple of factors:

Their legendary ability to automate things: admins manage thousands of servers each. After a quick search it seems that the exact figure is 20k servers per admin.
Their huge purchasing power: if they buy 20M servers at a time I guess that they get a special price. Hey, if the used their own house brand it would be a big one, if not the biggest one in the industry; after all they do that with web servers.

For the rest of us things are different. At 0.10$/kWh, one server using 1kW (a high powered beast) at all times costs ~900$/yr. For a 100k$/yr admin (fully loaded) the breaking point is at ~110 high-powered servers per admin -- kind of the industry average according to the first link. You have to manage more per admin to spend more on power than on people, so in an IT department with any development payroll should win hands down.

Similarly, if each server costs 3k$, the breaking point is with a lifecycle of just ~3 years. I would say that either machines cost more or use less juice, so servers should be above power too.

OT: Biggest expense

Posted Aug 4, 2009 22:30 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

I think that for most shops the server to admin ratio is well below 110:1

if you have any serious uses you have at least two people (probably 3) so that you have someone available all the time (with vacations, sick time, etc). a _lot_ of places which meet this criteria have fewer than the 220-330 servers that would be needed to maintain that ratio.

this ratio is also very dependent on how many different variations of server configurations that you have. google gets such phenomenal numbers of servers per admin by the fact that they have _lots_ of any one configuration. if they only had a couple thousand servers per configuration they would need far more admins than they do ;-) they also don't have their admins deal with failures, they just shut down the failed systems.

In many ways I would rather have another 50 servers to manage that fit in one of my existing baselines than to add 1 special exception box that is completely different.

OT: Biggest expense

Posted Aug 13, 2009 1:41 UTC (Thu) by deleteme (guest, #49633) [Link]

Well Google admin aren't really in charge of 20k servers but of 5-50 computing clusters that are used by developers and G* applications as a server.

One baseline is good but not acheivable.