benchmarking of dynamic power management solutions
TRANSCRIPT
Benchmarking of Dynamic Power Management Solutions
Frank Dols
CELF Embedded Linux Conference
Santa Clara, California (USA)
April 19, 2007
2 / 43
Why Benchmarking?
VendorVendor NXPNXP
What do we get?What do we get?
From Here to There, 2000whatever From Here to There, 2000whatever
!!
3 / 43
Presentation Outline:
The context;– Power management concepts.
– Hardware and software.
Benchmarks;– The process.
– Metrics.
– Findings.
Conclusions.– Relation to other work.
– What’s next.
4 / 43
Area Of Interest
Mobile/portable devices mostly exist of:1. Battery.
2. Storage (flash disk, hard disk, …).
3. Display (LCD, …).
4. Speaker.
5. Broadband I/O (Bluetooth, UMTS, …).
6. Processing (CPU)
We are initially only focusing on the optimization of the “Processing” power
consumption.
– As next, we are also taking 1-5 into account.
5 / 43
Power Measurements
A board under test
LabView
Mp3 playback – LabViewmeasurements
Measurements equipment: high precision DMM (digital Multi-Meters)
6 / 43
Energy Consumers
Energy saving methods trade performance or functionality for energy:– Scaling performance of processors, memories and buses;
– Using various stand-by modes of peripherals.
Energy saving is about supplying
the right amount of performance
at the right time.
However, the future is unknown!
other
CPU
Memory
LCD
DC/DC
Energy consuming components in a typical
audio/video playing mobile device.
7 / 43
Power Dissipation Basics
� Reduce capacitance switched.
� Reduce switching currents.
� Reduce operating voltage.
∫t
cDD fVC0
2)( Dynamic
Power Dissipation
Dynamic
Power Dissipation
dtIVfVCE QDD
t
cDD ))((
0
2+= ∫
Total Power
Dissipation
Total Power
Dissipation
� Reduce leakage current in active and standby modes of operation.
� Reduce operating voltage.
lkgDD
t
IV∫0
Leakage
Power Dissipation
Leakage
Power Dissipation+
8 / 43
Dynamic Voltage Frequency Scaling (DVFS)
Scales performance according to demand (using an estimation of future
workload).
Based on the fact that energy per clock cycle rises with frequency:
Implemented by switching between operating points (voltage and frequency
pairs).
fVP ⋅=2
ˆ
Clock Generator
Voltage Controller
Applications, task scheduler,or other source of information
Workload estimation
Processor ormemory
9 / 43
DVFS: How Does it Save Power
Power
Time
High f,V
Performance
Stand-by power
Idle
Full speed
50% CPU usage, no DVFS
Power
Time
Optimal!
Performance
Low f,V
50% speed
50% CPU usage, with DVFS
Performance
Power
TimeEnergy used
100% CPU usage
10 / 43
Performance Prediction Methods
100%
0%
Interval-based: use CPU-usage of previous interval(s) to determine frequency for next interval.
operating frequency
task /workload
100%
0%
Problem: late reaction on changing workloads �missed deadlines
11 / 43
Performance Prediction Methods
100%
0%
Application-directed: use information from the application to change frequencies.
Problems: requires changes in the application;not always possible (interactive applications).
task /workload
operating frequency
12 / 43
Dynamic Power Management (DPM) Concept
DPM software connects to OS kernel and collects data,– with collected data, and usage of policies, try to predict future workload.
Multiple policies categorize software workload.
Single global prediction of future performance is made.
OSDPM
SoftwareRequired
Performance
Policy
PolicyPolicy
Policy
Policy Evaluation
Volts
MHz
Apps.
Apps.
Apps.
Apps.
% of max.
performance
13 / 43
Application directed DVFS
Two main groups of mobile applications:
Internet browsing
Gaming
Interactive
Audio decoding
Video decoding
Streaming
Future workload unknown (depends on user input)
Future workload known to application
14 / 43
Presentation Outline:
The context;– Power management concepts.
– Hardware and software.
Benchmarks;– The process.
– Metrics.
– Findings.
Conclusions.– Relation to other work.
– What’s next.
15 / 43
Benchmark Process
Plan:– Define Key Performance Indicators (KPI).
– Define test to measure KPI.
Do:– Measurement on evaluation platform.
Check:– Analyze gained results and can they be explained.
– When necessary, cross verify with Vendor.
Act:– Take necessary actions on gained results.
16 / 43
Key Performance Indicators (KPI)
Memory usage.– Footprint of DPM framework and administration.
CPU usage.– Cycles consumed by DPM framework.
– System behavior, predictability & reproducibility.
System idleness.– Prediction of optimum frequency (minimization of idleness).
Real time behavior.– Application deadlines missed.
– Responsiveness (latency on events).
– DPM Policy prediction accuracy.
17 / 43
Test / Use Cases
Synthetic (fine-grained) benchmark,– Simulate different workload levels,
– Test corner cases.• LMbench as available from Open Source; lat_proc, lat_syscall, memory, clock, idle, …
Application level (coarse-grained) benchmark:– Whetstone & Dhrystone (artificial system load).
– Hartstone (real time behavior).
Video decoding,– ffplay mpeg video decoding use case.
• use ffplay, part of ffmpeg.
Audio decoding,– mp3 audio decoding use case.
• use ARM MP3 decoding library.
18 / 43
Metrics for SW based Benchmarking
Missed deadlines;
Overdue time;
Idle time;
Jitter;
Responsiveness.
19 / 43
Missed Deadlines
When operating frequency is too low, deadlines are missed:
Power
Time
Power
Time
20 / 43
Overdue Time
Are missed deadlines caused by timing inaccuracy, or by performance
deficiency?
Power
Time
Power
Time
Overdue time
21 / 43
Idle Time
When operating frequency is too high, extra slack time is introduced, and
power is wasted:
Power
Time
Power
Time
Power
Time
Optimal scenario
22 / 43
Jitter
Execution times vary, so time of completion varies as well:
Power
Time
Arrival times are periodic, but completion times vary ���� Jitter
arrival
completion
23 / 43
In Summary, DPM Metrics
Missed deadlines (OS schedule);
Overdue time;
Idle time;
Jitter (fluctuation in Completion time);
Responsiveness.
Power
Time
Power
Time
Missed Deadline
Overdue Time
Idle TimeCompletion
Time
24 / 43
NXP’s DVFS Benchmark Platform
Energizer I SoC;– ARM1176JZF-s,
– DVFS-enabled (core also),
– CPU voltage can be set in
25 mV increments,
– CPU frequency can be set using
2 PLL’s (300MHz and 400MHz by default)
and a divider.
Energizer I Software;– Linux 2.6.15,
– CPUfreq and PowerOP.
– MV’s DPM framework ported but not used.
25 / 43
CPU Usage CharacteristicsNo DPM
0
10
20
30
40
50
60
70
80
90
100
0
2,7
5,4
8,1
10,8
13,5
16,2
18,9
21,6
24,3 27
29,7
32,4
35,1
37,8
40,5
43,2
45,9
48,6
51,3 54
56,7
59,4
62,1
time (s)
perf
orm
an
ce (
%)
DPM Load CPU usage
Exponential Weighed Average Policy
0
20
40
60
80
100
120
0
1,9
3,8
5,7
7,6
9,5
11,4
13,3
15,2
17,1 19
20,9
22,8
24,7
26,6
28,5
30,4
32,3
34,2
36,1 38
39,9
41,8
43,7
45,6
47,5
49,4
51,3
53,2
55,1 57
58,9
60,8
62,7
time (s)
perf
orm
an
ce (%
)
DPM Load CPU usage
Moving Average Policy
0
20
40
60
80
100
120
0
1,9
3,8
5,7
7,6
9,5
11
,4
13
,3
15
,2
17
,1 19
20
,9
22
,8
24
,7
26
,6
28
,5
30
,4
32
,3
34
,2
36
,1 38
39
,9
41
,8
43
,7
45
,6
47
,5
49
,4
51
,3
53
,2
55
,1 57
58
,9
60
,8
62
,7
time (s)
pe
rfo
rma
nce
(%
)
DPM Load CPU usage
CPU usage,
applied policy reacts slow
on required performance.
26 / 43
DPM; Synthetic Workload
0
20
40
60
80
100
120
time
pe
rfo
rma
nc
e
(1st) Synthetic Workload (2nd) DPM, Predicted Performance Requirement
(3rd) Actual CPU Usage
Headroom
Late Response
(Exponential Weighed Average Policy)
27 / 43
Application Level Benchmark
Video playback (DivX).– Open-Source ffplay using frame-buffer device.
– Instrumented with time measurements.
Evaluation:– How many hard deadlines are missed with & without DPM.
28 / 43
ffplay Deadlines
• ffplay deadline: time interval in between displaying of
successive (decoded) frames.
- For 10 fps movie -> the deadline is 100 ms.
- With DPM activated,
fluctuation increases.
- REMARK: video output
(display) on CompactPresen-
ter via Compact flash interface,
results in high fluctuation
oops !! (see next slide).
ffplay@10fps
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487
frames
us
1
2
3
4
29 / 43
Kernel & User Space Activity
While running ffplay, sample the CPU activity in both
kernel and user space (based on /proc/stat).
Kernel activity dominates during the video playback.
ffplay@10fps
0
10
20
30
40
50
60
70
80
90
100
0 10000 20000 30000 40000 50000
time, us
CPU u
sage, %
user
kernel
30 / 43
DPM; Video Decoding
0
20
40
60
80
100
5 10 15 20 25
frames per second
dead
lin
es m
issed
/ i
dle
tim
e (
%)
idle idle DPM deadlines deadlines DPM
Power Savings
Performance Impact
PolicyOverhead
(Exponential Weighed Average Policy)
System Overload
31 / 43
Video Decoding Jitter
1
10
100
1000
10000
0 5 10 15 20 25 30
frames per second
jitt
er
(milliseco
nd
s)
jitter jitter DPM
Clock scaling causes jitter increase
Jitter metric gets polluted by high number of deadline misses
32 / 43
Video Decoding and Hints
CITY_5fps.avi
0
20
40
60
80
100
120
0 5 10 15 20 25 30 35 40 45 50
time (s)
fre
qu
en
cy
(M
Hz
)
No DPM DPM, No hints DPM, with Hints
33 / 43
0
10
20
30
40
50
60
70
80
2,3 3,55 4,8 6,05 7,3
idle
tim
e (%
)
5 7,5 10 12,5 15
frames per second
Custom DPM disabled Custom DPM enabled, without hints Custom DPM enabled, with hints
DPM disabled DPM enabled
Comparison
DPM without hints is only effective at
low workloads
When application-directed hints are enabled;• idle time reduction increases,• effective in a wider frame rate window.
34 / 43
Presentation Outline:
The context;– Power management concepts.
– Hardware and software.
– Relation to other work.
Benchmarks;– The process.
– Metrics.
– Findings.
Conclusions.– Relation to other work.
– What’s next.
35 / 43
Technology Conclusions
Dynamic Power Management (DPM).– Added value is in policies.
– Performs best on low workloads.
– Adaptation to changing workload is heavily dependent on chosen policy.
– As application developer, you have to develop your own policies.
– Be aware, policy adds overhead.
Applications cannot feed data directly into DPM.– Performance prediction based on logging application history does not always
work.
– The best performance prediction may come from the application itself (feed
forward)!
– Hinting can give improvements over an interval based approach.
36 / 43
Conclusions
Interval based DPM at CPU-level results in higher CPU utilisation, but seems
only really effective at low workloads;
Precise energy saving figures should be measured using HW measuring
tools, but SW benchmarks give good insight in saving potentials and
consequences on real-time behaviour.
SoC is not the biggest power consumer.– Backlight, DC/DC converters & power amplifiers are big consumers, by optimizing
these, most can be gained.
37 / 43
What’s Next
1. CELF (you guys), MIPI PM working group
- Matt Locke and his team
2. Mode switching.
3. Correlation between hardware and software;
- What to solve at which level?
4. Deadline based scheduling, from priority based to time based.
5. Start contributing to the Community on this.
Regarding bullet 1, 2 & 3, see extra slides
38 / 43
Overview of Open Source Power Mgt. Software
CPU power management
Standby (low power mode)when idle
DynamicVoltage/FrequencyScaling
Device power management
Suspend/Resumeof
devices
(memory)bus
frequencyscaling
Suspendcompletesystem
CPUfreqTask
schedulerAPM/
ACPI
Power
OP
Linux Device
Model
(also look at Mark Gross presentation)
39 / 43
Mode Transitions
ACTIVE mode LP mode ACTIVE modetransition transition
app
policy
slp_sw
app
slp_hw wu_hw
wu_sw
PM policy (performance monitoring, decision taking)
Mode transition in sw domain
(state save, sequencing, PM infra control)
Mode transition in hw domain(sequencing, communication)
event
Mode transition in hw domain(communication, Voltage/
Frequency settling,
sequencing)
Mode transition in sw domain
(warm boot, state restore, sequencing)
40 / 43
Typical
Hardware
aided
Improving power managementthrough co-design of hardware and software
Other ideas for improving power management– Performance counters for other peripherals:
• Cache miss rate is an indicator for memory bus usage of the CPU.
– Hardware aided workload prediction algorithms (in embedded FPGA):• DVFS algorithm in FPGA, instead of software;• Enables fast calculation of estimations and quick operating point changes;• Algorithm can be changed for every use case because of the use of an FPGA.
– More hardware acceleration for specific applications:• Needs discussion with SW developers in early stage about which parts can be
accelerated;• Enables running on a lower frequency, and as such saving energy.
Workload
prediction
algorithm
Policy
manager
DVFS Hardware
(clock generator,
voltage controller)Applications
Software
Hardware
OS
FPGA
41 / 43
To come to the End.
42 / 43
From Here to There, 2000whatever From Here to There, 2000whatever
We Started With the Question:“Why Benchmarking?”
Do your homework,
don’t bring in the Trojan Horse!
43 / 43
44 / 43
NXP Semiconductors
Established in 2006(formerly a division of Philips)
Builds on a heritage of 50+ years of experience in semiconductors
Provides engineers and designers with semiconductors and software that deliver better sensory experiences
Top-10 supplier with Sales of € 4.960 Bln (2006)
Sales: 35% Greater China, 31% Rest of Asia, 25% Europe, 9% North America
Headquarters: Eindhoven, The Netherlands
Key focus areas:
– Mobile & Personal, Home, Automotive & Identification, Multimarket Semiconductors
Owner of NXP Software: a fully independent software solutions company
45 / 43
Where to Find Us
Operating systems Knowledge Centre (OKC)
phone: +31 40 - 27 45 131
fax: +31 40 - 27 43 630
email: [email protected]
Web: www.nxp.com
address: High Tech Campus 46
location 3.31
5656 AE Eindhoven (NL)
46 / 43