dypo: dynamic pareto-optimal configuration ... - chetanpatil
TRANSCRIPT
DyPO: Dynamic Pareto-Optimal Configuration Selection for
Heterogeneous MpSoCs
School of Electrical, Computer, and Energy Engineering
Ujjwal Gupta, Chetan A. Patil, Ganapati Bhat*, Umit Y. OgrasArizona State University, Tempe, AZ USA
Prabhat MishraUniversity of Florida, Gainsville, FL USA
Oct 16, 2017 at ESWEEK
2
§ Motivation§ Related Work§ Dynamic Pareto-Optimal (DyPO) Configuration Selection
– Overview and Illustrative Example– Details of the DyPO Framework
§ Experimental Results§ Conclusion
Outline
3
Motivation
§ Mobile phones have a highly heterogeneous processor– Example: Big.LITTLE architecture
§ Consequently, a number of configurations knobs– Frequency, number of cores,
types of cores (4004 in Big.Little!)
§ Power and performance change– As a function of configurations,– And workloads
§ Are the configurations optimal for other metrics like energy?
BIG CLUSTER LITTLE CLUSTER
CORE 0 CORE 1
CORE 2 CORE 3
CORE 0 CORE 1
CORE 2 CORE 3
SAMSUNG EXYNOS 5422
Blackscholes application – 2T
4
§ Good Power-Performance tradeoff does not mean good Energy-Performance tradeoff
§ Dynamically selecting the optimal config is challenging– Large design space prohibits runtime exploration – Basicmath application has 3 phases, this will require searching
through 4004# ≈ 6×10)* different possibilities!– Workload changes at runtime
Challenges for Optimization{3L, 1.4 GHz, 2B, 2 GHz}
{1L,0.6 GHz, 2B, 0.6 GHz}
{4L,0.8 GHz, 3B, 0.8 GHz}
5
§ Recently, a number of papers have applied DPM and DVFS together§ [Aalsaud et al. 2016]
– Uses separate power and performance models for each frequency and application on the platform
– Chooses the best performance per watt at runtime using the models– High overhead of learning models for a new application at runtime (not scalable)
§ [Cochran et al. 2011]– Finds the optimal frequency and thread packing for a workload in homogeneous
systems for maximizing performance under a power budget– Does not use phase-level instrumentation
§ [Donyanavard et al. 2016] – Binning-based classification for identifying the degree of memory- and compute-
boundedness of the tasks– Task allocation based predicted power/performance to CPU cores for minimizing
the power consumption under throughput constraint
§ Our technique performs phase-level DVFS and DPM together to optimize any given multi-objective goal in heterogeneous systems
Related Research
6
DyPO Overview and IllustrationBB_PAPI_read()
BB1BB2
BB1MBB_PAPI_read()
BB(1M + 1)BB(1M + 2)
BB2MBB_PAPI_read()
Find optimal Config (𝑐,∗){2L, 1GHz,3B, 1GHz}
Find 𝑓 𝒑, = 𝑐,∗(boundary)
#LLC-misses #ins
truct
ions
-retd
.
𝒄𝒌∗ ≈ 𝒇()
Instrumented AppsLLVM
Cla
ngIn
stru
men
tatio
n
Phase-1𝑥)=10M𝑥A=10k
Phase-2𝑥)=10M𝑥A=1k
Input (𝒙)Phase-3𝑥)= 10M𝑥A= 9k
Step 1: Instrumentation
{4L, 2GHz,4B, 2GHz}
Output (𝒄𝒌∗ )
Online
New App
𝑥)= #Instr.-retd. 𝑥A= #LLC-misses
Pha
se-1
Pha
se-2
Step 2: Characterization
Multiple Apps
Start:BB1BB2
BB100M End
A1
A2A3
…
……
Example:
Phase-3 issimilar to Phase-1
Use the same optimal configuration as Phase-1
Step 3: Classification
{2L, 1GHz, 3B, 1GHz}
Classifier
∀𝒌New Phase P𝑪𝒏𝒆𝒙𝒕 = 𝒇{𝒑𝒌}
Step 4: Online Selection
Offline
§ Four Steps:1. Instrumentation2. Characterization3. Classification 4. Online selection
7
§ We use PAPI tool to obtain– Application and system level
parameters
§ Sampling counters at uniform time intervals not work conservative– Workload in each time interval will be
different for each configuration
Step 1: Workload Instrumentation
0.0 0.5 1.0 1.5 2.0 2.5 3.020406080
100120140
Inst
ruct
ions
(x10
6 )
Time (s)
f = 0.6 GHz f = 2 GHzBasicmath app sampled at 50ms
Instructions should not change with configuration
8
§ Group a sequence of basic blocks into distinct snippets§ Each snippet or a sequence of snippets may form
workload phases§ BB_PAPI_read() is our custom API to read application
and system level parameters
Basic Block Level InstrumentationBB_PAPI_read()
BB1BB2
BB1MBB_PAPI_read()
BB(1M + 1)BB(1M + 2)
BB2MBB_PAPI_read()Instrumented AppsLL
VM
Cla
ngIn
stru
men
tatio
n
Pha
se-1
Pha
se-2
Multiple Apps
Start:BB1BB2
BB100M End
A1
A2A3
…
……
Reminder:
9
§ LLVM+Clang is used to generate instrumented binaries at basic block granularity
§ The average overhead (number of extra instructions added) for 18 applications is only 1%
Instrumentation ProcessPAPI Instrumentation
Pass Using LLVM
Custom Library (.so)
Clang Compiler
Benchmark (.c/cpp)
Instrumented Benchmark (.o)
Cmake/Make
Inpu
tsO
utpu
t
10
§ We run 3 iterations of each benchmark at different frequency and core configuration
§ The selection includes – All core configurations from
1L+1B to 4L+4B– All frequencies in steps of
0.2GHz from 0.6 to 2GHz– Leads to 128 configs
§ We profile 18 benchmarks – Leads to 4467 distinct
workload snippets§ Time spent for one
benchmark is about 1-2 hours
Step 2: Offline Characterization
Sweep little cores (𝑛I)
Sweep frequencies (𝑛J)
Repeat benchmark (𝑛K)
𝑛L×𝑛I×𝑛J×𝑛K
Sweep big cores (𝑛L)
Configuration Data per Benchmark
𝑛L Number of big cores𝑛I Number of little cores𝑛J Number of frequencies𝑛K Number of iterations𝑓I Little core frequency𝑓L Big core frequency
11
§ Set of all possible configurations: set 𝐂– The configuration at sample 𝑘 is a tuple
𝑐, = 𝑛I,,, 𝑓I,,, 𝑛L,,, 𝑓L,, ∈ 𝐂
§ The set of all phases in a application: set 𝐏– The phase at sample 𝑘 is 𝑝, ∈ 𝐏
Problem Formulation
𝐅𝐢𝐧𝐝𝑓: 𝐏 ∋ 𝑝, ⟼ 𝑐,∗ ∈ 𝐂Goal:
where 𝑐,∗ ∈ 𝐂 is the optimal configuration for workload phase 𝑝, ∈ 𝐏
12
§ Energy consumption and responsiveness are of primary importance in mobile systems
§ Consider a bi-objective optimization problem
Step 3: Optimal Configuration Classification
𝐽 𝑝, = min]^[𝐸 𝑐,, 𝑝, + 𝜇𝑡ded(𝑐,, 𝑝,)]
§ We can also form any multi-optimization objective using 𝐸, 𝑡ded, 𝐼𝑃𝑆, 𝑃
§ The relative weight 𝜇 ≥ 0 determines the importance of objectives– When 𝜇 ≈ 0, the optimization becomes DyPO-Energy– Large 𝜇 implies performance optimization à DyPO-Performance
PerformanceEnergy
13
§ Each snippet 𝑝, has a feature data 𝒙, associated with it§ Here 𝑐,∗ is the supervisory response§ Supervised machine learning techniques can be used to find 𝑓
Classifier: Inputs and Outputs
Feature data (𝒙,)
Optimal config (𝑐,∗)
Characterization dataSolve for 𝐽(𝑝,)
H/W cntrs,CPU stats
𝐽 𝑝, = min]^[𝐸 𝑐,, 𝑝, + 𝜇𝑡ded(𝑐,, 𝑝,)]
𝐅𝐢𝐧𝐝𝑓: 𝑷 ∋ 𝑝, ⟼ 𝑐,∗ ∈ 𝑪
Feature Data (𝑥,)1 (bias term)#CPU Cycles/#Instr#Branch Miss Prediction/#Instr#LLC $ Misses/#Instr#Data Memory Access/#Instr#Non-cache Ext Mem Req/#Instr∑(Little Core Utilizations)Sorted Big Core Utilizations
14
Offline Learning of the Classifier
Feature data (𝒙,)
Optimal config (𝑐,∗)
𝑥m
𝑥 mn) 𝑓 𝜷
Training
Solve 𝑙(𝜷)offline to get model coeff. 𝜷
Characterization dataSolve for 𝐽(𝑝,)
H/W cntrs,CPU stats
Conditional Prob:
Likelihood: 𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w
,x)
Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^
1 + 𝑒𝜷𝒙^
𝑀 = #workload snippets × #configs
§ We use multinomial logistic regression – Classify different phases and output optimal configurations
§ The likelihood function is maximized at design time– To obtain the model coefficient matrix 𝜷
15
§ Two different features 𝒙, can represent the same phase– Hence map to the same optimal configuration 𝑐,∗
§ An arbitrary assignment of 𝑐,∗ to 𝒙, lead to a poor mapping function 𝑓§ Therefore, we perform k-means clustering to find natural clustering in
the dataset– This provides distinct phases out of all the workload snippets– Then, we map the most common optimal configuration to each of the phases
Design of Classifier
k-means clustering Feature data (𝒙,)
Optimal config (𝑐,∗)
𝑥m
𝑥 mn) 𝑓 𝜷
Training
Solve 𝑙(𝜷)offline to get model coeff. 𝜷
Characterization dataSolve for 𝐽(𝑝,)
H/W cntrs,CPU stats
Conditional Prob:
Likelihood: 𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w
,x)
Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^
1 + 𝑒𝜷𝒙^
𝑀 = #workload snippets × #configs
16
Summary of Classifier Design
k-means clustering Feature data (𝒙,)
Optimal config (𝑐,∗)
Store 𝜷
𝑥m
𝑥 mn) 𝑓 𝜷
Training
Solve 𝑙(𝜷)offline to get model coeff. 𝜷
Find Pr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)to solve for 𝑐,∗
Read new feature data (𝒙,) and stored 𝜷
At runtimeCharacterization data
H/W cntrs,CPU stats
Objective:
Conditional Prob:
Likelihood:
Solve for 𝐽(𝑝,)
𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w
,x)
Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^
1 + 𝑒𝜷𝒙^
𝐽 𝑝, = min]^[𝐸 𝑐,, 𝑝, + 𝜇𝑡ded(𝑐,, 𝑝,)]
17
Step 4: Online SelectionRead feature data (𝒙𝒌)
Pr 𝐂 = 𝑐{∗ 𝐗 = 𝒙, =𝑒𝜷|𝒙^
1 + 𝑒𝜷|𝒙^
Compute the conditional probability for each class 𝑖 ∈ 1, 𝑁
𝜷{ 𝒙,𝜷{𝒙,
Find most likely class 𝑖, 𝑎𝑟𝑔𝑚𝑎𝑥
{Pr(𝐂 = 𝑐{∗|𝐗 = 𝒙,)
§ Use the model coefficient matrix 𝜷stored in the platform (~282 bytes)
§ The conditional probability is computed at runtime for each of the 𝑵classes
§ The class with the highest probability is assigned to the system
Ove
rhea
d = 20𝜇𝑠
18
§ Motivation§ Related Work§ Dynamic Pareto-Optimal (DyPO) Configuration Selection
– Overview and Illustrative Example– Details of the DyPO Framework
§ Experimental Results§ Conclusion
Outline
19
§ Odroid-XU3 board– Exynos 5422 Octa-core CPU
§ Validated using 18 apps running on Ubuntu kernel v3.10– Mi-bench suite– Cortex suite – Parsec suite
Experimental Setup
Sysfsinterface
Power
Frequency
Core Config
Utilization
Perf Driver
CPUGovernor
Output D
ata For A
nalysis
Benchmark
DyPO
PAPI Calls
Kernel Space User Space
PMU
20
§ We randomly partition the data in 60% training-validation set and 40% test set, leading to average accuracy of– Level 1 classifier: 99.9%– Level 2 classifier: 92.7%
§ 5-fold cross-validation show average accuracy– Level 1 classifier: 99.9%– Level 2 classifier: 80.5%
Classification Accuracy
Basimath FFT
QsortSHA
Blowfish
Kmeans
PCA
Blackscholes-2T
Blackscholes-4T
Fluidanimate-2T
Fluidanimate-
4T
Average0
20406080
100
Acc
urac
y (%
)
Level-1 classifier Level-2 classif ierLevel-1
Level-2𝑐A∗𝑐)∗
𝑐�∗𝑐#∗ 𝑐�∗
21
§ DyPO-Energy achieves the minimum energy goal§ Interactive and powersave are away from Pareto-frontier§ Ondemand is near the highest Pareto-frontier point
Running Single-threaded Apps
8.7 % Energy savings30 % Better performance
22
Running Single-threaded Apps
We observe similar trend across all 14 single-threaded benchmarks
23
§ DyPO-Energy achieves the minimum energy goal by staying near the lowest Pareto-frontier point
§ In Fluidanimate-2T, DyPO-Energy provides better result than Pareto-frontier
Running Multi-threaded Apps
24
Running Multiple Applications
Minimum Energy:
12 J 8.7 J 3 J≈ +
The DyPO framework can perform optimization for multiple applications, using the hardware counters
25
§ 49%, 45% and 6% energy savings compared to the interactive, ondemand and powersave governors
Comparing Energy Consumption
26
§ While achieving minimum energy, 24% better performance compared to the powersave governor
Comparing Performance
27
§ First derive power and performance models using linear regression for each frequency and application
§ Use these models to determine optimal performance per watt (PPW) configuration at runtime
Comparison with Aalsaud et al. 2016
Blowfish
SpectralAES
Kmeans
Fluidanim
ate-2T
ADPCMPCA
MotionEsti
mation
Basicmath
Blackscho
les-4T
Blackscho
les-2TPatr
iciaQso
rt
Fluida
nimate
-4T
DijkstraFFTSHA
StringS
earch
Averag
e0.0
0.5
1.0
1.5
2.0
2.5
Norm
aliz
ed P
PW
DyP O- Energy Aalsaud-Offline Aalsaud-ADA Ondemand
DyPO-Energy shows 25% and 55% gain in PPW compared to Aalsaud-offline and Aalsaud-ADA
28
§ Our technique successfully finds the Pareto-optimal configurations at runtime as a function of the workload
§ Experiments show an average increase of – 93%, 81% and 6% in performance per watt (PPW) compared
to the interactive, ondemand and powersave governors
Conclusions
Thank you!