dypo: dynamic pareto-optimal configuration ... - chetanpatil

DyPO: Dynamic Pareto-Optimal Configuration Selection for

Heterogeneous MpSoCs

School of Electrical, Computer, and Energy Engineering

Ujjwal Gupta, Chetan A. Patil, Ganapati Bhat*, Umit Y. OgrasArizona State University, Tempe, AZ USA

Prabhat MishraUniversity of Florida, Gainsville, FL USA

Oct 16, 2017 at ESWEEK

2

§ Motivation§ Related Work§ Dynamic Pareto-Optimal (DyPO) Configuration Selection

– Overview and Illustrative Example– Details of the DyPO Framework

§ Experimental Results§ Conclusion

Outline

3

Motivation

§ Mobile phones have a highly heterogeneous processor– Example: Big.LITTLE architecture

§ Consequently, a number of configurations knobs– Frequency, number of cores,

types of cores (4004 in Big.Little!)

§ Power and performance change– As a function of configurations,– And workloads

§ Are the configurations optimal for other metrics like energy?

BIG CLUSTER LITTLE CLUSTER

CORE 0 CORE 1

CORE 2 CORE 3

CORE 0 CORE 1

CORE 2 CORE 3

SAMSUNG EXYNOS 5422

Blackscholes application – 2T

4

§ Good Power-Performance tradeoff does not mean good Energy-Performance tradeoff

§ Dynamically selecting the optimal config is challenging– Large design space prohibits runtime exploration – Basicmath application has 3 phases, this will require searching

through 4004# ≈ 6×10)* different possibilities!– Workload changes at runtime

Challenges for Optimization{3L, 1.4 GHz, 2B, 2 GHz}

{1L,0.6 GHz, 2B, 0.6 GHz}

{4L,0.8 GHz, 3B, 0.8 GHz}

5

§ Recently, a number of papers have applied DPM and DVFS together§ [Aalsaud et al. 2016]

– Uses separate power and performance models for each frequency and application on the platform

– Chooses the best performance per watt at runtime using the models– High overhead of learning models for a new application at runtime (not scalable)

§ [Cochran et al. 2011]– Finds the optimal frequency and thread packing for a workload in homogeneous

systems for maximizing performance under a power budget– Does not use phase-level instrumentation

§ [Donyanavard et al. 2016] – Binning-based classification for identifying the degree of memory- and compute-

boundedness of the tasks– Task allocation based predicted power/performance to CPU cores for minimizing

the power consumption under throughput constraint

§ Our technique performs phase-level DVFS and DPM together to optimize any given multi-objective goal in heterogeneous systems

Related Research

6

DyPO Overview and IllustrationBB_PAPI_read()

BB1BB2

BB1MBB_PAPI_read()

BB(1M + 1)BB(1M + 2)

BB2MBB_PAPI_read()

Find optimal Config (𝑐,∗){2L, 1GHz,3B, 1GHz}

Find 𝑓 𝒑, = 𝑐,∗(boundary)

#LLC-misses #ins

truct

ions

-retd

.

𝒄𝒌∗ ≈ 𝒇()

Instrumented AppsLLVM

Cla

ngIn

stru

men

tatio

n

Phase-1𝑥)=10M𝑥A=10k

Phase-2𝑥)=10M𝑥A=1k

Input (𝒙)Phase-3𝑥)= 10M𝑥A= 9k

Step 1: Instrumentation

{4L, 2GHz,4B, 2GHz}

Output (𝒄𝒌∗ )

Online

New App

𝑥)= #Instr.-retd. 𝑥A= #LLC-misses

Pha

se-1

Pha

se-2

Step 2: Characterization

Multiple Apps

Start:BB1BB2

BB100M End

A1

A2A3

…

……

Example:

Phase-3 issimilar to Phase-1

Use the same optimal configuration as Phase-1

Step 3: Classification

{2L, 1GHz, 3B, 1GHz}

Classifier

∀𝒌New Phase P𝑪𝒏𝒆𝒙𝒕 = 𝒇{𝒑𝒌}

Step 4: Online Selection

Offline

§ Four Steps:1. Instrumentation2. Characterization3. Classification 4. Online selection

7

§ We use PAPI tool to obtain– Application and system level

parameters

§ Sampling counters at uniform time intervals not work conservative– Workload in each time interval will be

different for each configuration

Step 1: Workload Instrumentation

0.0 0.5 1.0 1.5 2.0 2.5 3.020406080

100120140

Inst

ruct

ions

(x10

6 )

Time (s)

f = 0.6 GHz f = 2 GHzBasicmath app sampled at 50ms

Instructions should not change with configuration

8

§ Group a sequence of basic blocks into distinct snippets§ Each snippet or a sequence of snippets may form

workload phases§ BB_PAPI_read() is our custom API to read application

and system level parameters

Basic Block Level InstrumentationBB_PAPI_read()

BB1BB2

BB1MBB_PAPI_read()

BB(1M + 1)BB(1M + 2)

BB2MBB_PAPI_read()Instrumented AppsLL

VM

Cla

ngIn

stru

men

tatio

n

Pha

se-1

Pha

se-2

Multiple Apps

Start:BB1BB2

BB100M End

A1

A2A3

…

……

Reminder:

9

§ LLVM+Clang is used to generate instrumented binaries at basic block granularity

§ The average overhead (number of extra instructions added) for 18 applications is only 1%

Instrumentation ProcessPAPI Instrumentation

Pass Using LLVM

Custom Library (.so)

Clang Compiler

Benchmark (.c/cpp)

Instrumented Benchmark (.o)

Cmake/Make

Inpu

tsO

utpu

t

10

§ We run 3 iterations of each benchmark at different frequency and core configuration

§ The selection includes – All core configurations from

1L+1B to 4L+4B– All frequencies in steps of

0.2GHz from 0.6 to 2GHz– Leads to 128 configs

§ We profile 18 benchmarks – Leads to 4467 distinct

workload snippets§ Time spent for one

benchmark is about 1-2 hours

Step 2: Offline Characterization

Sweep little cores (𝑛I)

Sweep frequencies (𝑛J)

Repeat benchmark (𝑛K)

𝑛L×𝑛I×𝑛J×𝑛K

Sweep big cores (𝑛L)

Configuration Data per Benchmark

𝑛L Number of big cores𝑛I Number of little cores𝑛J Number of frequencies𝑛K Number of iterations𝑓I Little core frequency𝑓L Big core frequency

11

§ Set of all possible configurations: set 𝐂– The configuration at sample 𝑘 is a tuple

𝑐, = 𝑛I,,, 𝑓I,,, 𝑛L,,, 𝑓L,, ∈ 𝐂

§ The set of all phases in a application: set 𝐏– The phase at sample 𝑘 is 𝑝, ∈ 𝐏

Problem Formulation

𝐅𝐢𝐧𝐝𝑓: 𝐏 ∋ 𝑝, ⟼ 𝑐,∗ ∈ 𝐂Goal:

where 𝑐,∗ ∈ 𝐂 is the optimal configuration for workload phase 𝑝, ∈ 𝐏

12

§ Energy consumption and responsiveness are of primary importance in mobile systems

§ Consider a bi-objective optimization problem

Step 3: Optimal Configuration Classification

𝐽 𝑝, = min]^[𝐸 𝑐,, 𝑝, + 𝜇𝑡ded(𝑐,, 𝑝,)]

§ We can also form any multi-optimization objective using 𝐸, 𝑡ded, 𝐼𝑃𝑆, 𝑃

§ The relative weight 𝜇 ≥ 0 determines the importance of objectives– When 𝜇 ≈ 0, the optimization becomes DyPO-Energy– Large 𝜇 implies performance optimization à DyPO-Performance

PerformanceEnergy

13

§ Each snippet 𝑝, has a feature data 𝒙, associated with it§ Here 𝑐,∗ is the supervisory response§ Supervised machine learning techniques can be used to find 𝑓

Classifier: Inputs and Outputs

Feature data (𝒙,)

Optimal config (𝑐,∗)

Characterization dataSolve for 𝐽(𝑝,)

H/W cntrs,CPU stats


𝐅𝐢𝐧𝐝𝑓: 𝑷 ∋ 𝑝, ⟼ 𝑐,∗ ∈ 𝑪

Feature Data (𝑥,)1 (bias term)#CPU Cycles/#Instr#Branch Miss Prediction/#Instr#LLC $ Misses/#Instr#Data Memory Access/#Instr#Non-cache Ext Mem Req/#Instr∑(Little Core Utilizations)Sorted Big Core Utilizations

14

Offline Learning of the Classifier

Feature data (𝒙,)


𝑥m

𝑥 mn) 𝑓 𝜷

Training

Solve 𝑙(𝜷)offline to get model coeff. 𝜷


H/W cntrs,CPU stats

Conditional Prob:

Likelihood: 𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w

,x)

Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^

1 + 𝑒𝜷𝒙^

𝑀 = #workload snippets × #configs

§ We use multinomial logistic regression – Classify different phases and output optimal configurations

§ The likelihood function is maximized at design time– To obtain the model coefficient matrix 𝜷

15

§ Two different features 𝒙, can represent the same phase– Hence map to the same optimal configuration 𝑐,∗

§ An arbitrary assignment of 𝑐,∗ to 𝒙, lead to a poor mapping function 𝑓§ Therefore, we perform k-means clustering to find natural clustering in

the dataset– This provides distinct phases out of all the workload snippets– Then, we map the most common optimal configuration to each of the phases

Design of Classifier

k-means clustering Feature data (𝒙,)


𝑥m

𝑥 mn) 𝑓 𝜷

Training



H/W cntrs,CPU stats

Conditional Prob:

Likelihood: 𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w

,x)

Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^

1 + 𝑒𝜷𝒙^

𝑀 = #workload snippets × #configs

16

Summary of Classifier Design

k-means clustering Feature data (𝒙,)


Store 𝜷

𝑥m

𝑥 mn) 𝑓 𝜷

Training


Find Pr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)to solve for 𝑐,∗

Read new feature data (𝒙,) and stored 𝜷

At runtimeCharacterization data

H/W cntrs,CPU stats

Objective:

Conditional Prob:

Likelihood:

Solve for 𝐽(𝑝,)

𝑙 𝛽 = rPr(𝐂 = 𝑐,∗|𝐗 = 𝒙,)w

,x)

Pr 𝐂 = 𝑐,∗ 𝐗 = 𝒙, =𝑒𝜷𝒙^

1 + 𝑒𝜷𝒙^


17

Step 4: Online SelectionRead feature data (𝒙𝒌)

Pr 𝐂 = 𝑐{∗ 𝐗 = 𝒙, =𝑒𝜷|𝒙^

1 + 𝑒𝜷|𝒙^

Compute the conditional probability for each class 𝑖 ∈ 1, 𝑁

𝜷{ 𝒙,𝜷{𝒙,

Find most likely class 𝑖, 𝑎𝑟𝑔𝑚𝑎𝑥

{Pr(𝐂 = 𝑐{∗|𝐗 = 𝒙,)

§ Use the model coefficient matrix 𝜷stored in the platform (~282 bytes)

§ The conditional probability is computed at runtime for each of the 𝑵classes

§ The class with the highest probability is assigned to the system

Ove

rhea

d = 20𝜇𝑠

18

§ Motivation§ Related Work§ Dynamic Pareto-Optimal (DyPO) Configuration Selection

– Overview and Illustrative Example– Details of the DyPO Framework

§ Experimental Results§ Conclusion

Outline

19

§ Odroid-XU3 board– Exynos 5422 Octa-core CPU

§ Validated using 18 apps running on Ubuntu kernel v3.10– Mi-bench suite– Cortex suite – Parsec suite

Experimental Setup

Sysfsinterface

Power

Frequency

Core Config

Utilization

Perf Driver

CPUGovernor

Output D

ata For A

nalysis

Benchmark

DyPO

PAPI Calls

Kernel Space User Space

PMU

20

§ We randomly partition the data in 60% training-validation set and 40% test set, leading to average accuracy of– Level 1 classifier: 99.9%– Level 2 classifier: 92.7%

§ 5-fold cross-validation show average accuracy– Level 1 classifier: 99.9%– Level 2 classifier: 80.5%

Classification Accuracy

Basimath FFT

QsortSHA

Blowfish

Kmeans

PCA

Blackscholes-2T

Blackscholes-4T

Fluidanimate-2T

Fluidanimate-

4T

Average0

20406080

100

Acc

urac

y (%

)

Level-1 classifier Level-2 classif ierLevel-1

Level-2𝑐A∗𝑐)∗

𝑐�∗𝑐#∗ 𝑐�∗

21

§ DyPO-Energy achieves the minimum energy goal§ Interactive and powersave are away from Pareto-frontier§ Ondemand is near the highest Pareto-frontier point

Running Single-threaded Apps

8.7 % Energy savings30 % Better performance

22

Running Single-threaded Apps

We observe similar trend across all 14 single-threaded benchmarks

23

§ DyPO-Energy achieves the minimum energy goal by staying near the lowest Pareto-frontier point

§ In Fluidanimate-2T, DyPO-Energy provides better result than Pareto-frontier

Running Multi-threaded Apps

24

Running Multiple Applications

Minimum Energy:

12 J 8.7 J 3 J≈ +

The DyPO framework can perform optimization for multiple applications, using the hardware counters

25

§ 49%, 45% and 6% energy savings compared to the interactive, ondemand and powersave governors

Comparing Energy Consumption

26

§ While achieving minimum energy, 24% better performance compared to the powersave governor

Comparing Performance

27

§ First derive power and performance models using linear regression for each frequency and application

§ Use these models to determine optimal performance per watt (PPW) configuration at runtime

Comparison with Aalsaud et al. 2016

Blowfish

SpectralAES

Kmeans

Fluidanim

ate-2T

ADPCMPCA

MotionEsti

mation

Basicmath

Blackscho

les-4T

Blackscho

les-2TPatr

iciaQso

rt

Fluida

nimate

-4T

DijkstraFFTSHA

StringS

earch

Averag

e0.0

0.5

1.0

1.5

2.0

2.5

Norm

aliz

ed P

PW

DyP O- Energy Aalsaud-Offline Aalsaud-ADA Ondemand

DyPO-Energy shows 25% and 55% gain in PPW compared to Aalsaud-offline and Aalsaud-ADA

28

§ Our technique successfully finds the Pareto-optimal configurations at runtime as a function of the workload

§ Experiments show an average increase of – 93%, 81% and 6% in performance per watt (PPW) compared

to the interactive, ondemand and powersave governors

Conclusions

Thank you!

dypo: dynamic pareto-optimal configuration ... - chetanpatil

Documents