User: Password:
Subscribe / Log in / New account

RAPL (Running Average Power Limit) driver

From:  Jacob Pan <>
To:  LKML <>, Platform Driver <>, Matthew Garrett <>
Subject:  [PATCH 0/1] RAPL (Running Average Power Limit) driver
Date:  Tue, 2 Apr 2013 15:15:35 -0700
Message-ID:  <>
Cc:  Zhang Rui <>, Rafael Wysocki <>, Len Brown <>, Srinivas Pandruvada <>, Arjan van de Ven <>, Jacob Pan <>
Archive-link:  Article

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

1. sysfs layout

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

├── cur_state
├── device -> ../../../platform/intel_rapl/rapl_domains/package
├── max_state
├── power
├── subsystem -> ../../../../class/thermal
├── type
└── uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.
├── driver -> ../../../bus/platform/drivers/intel_rapl
├── power
├── rapl_domains
│   ├── package
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device15  
│   ├── power_plane_0
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device16  
│   └── power_plane_1
│       └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device18  
└── subsystem -> ../../../bus/platform

2. per domain parameters

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *

3. event notifications

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

4. Usage Examples (assume the topology in the sysfs layout above)

- set power limit to package domain (whole SOC package) to 6w
root@chromoly:~# echo 6000
	> /sys/class/thermal/cooling_device15/cur_state  

- set power limit to pp1 domain (graphics) to 4w
root@chromoly:~# echo 4000
	> /sys/class/thermal/cooling_device18/cur_state  

- check the current power usage in mWatts of pp1 domain
root@chromoly:~# cat  /sys/class/thermal/cooling_device18/cur_state 

- set event notification when power consumption of graphics unit crosses
  event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given


1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

4. Data polling starts only when the following items are set
	- power limit
	- events

Jacob Pan (1):
  Introduce Intel RAPL cooling device driver

 drivers/platform/x86/Kconfig      |    8 +
 drivers/platform/x86/Makefile     |    1 +
 drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
 drivers/platform/x86/intel_rapl.h |  249 +++++++
 4 files changed, 1581 insertions(+)
 create mode 100644 drivers/platform/x86/intel_rapl.c
 create mode 100644 drivers/platform/x86/intel_rapl.h


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds