autotuning programs with algorithmic choice - mit...

89
Introduction PetaBricks OpenTuner Conclusions Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL December 18, 2013

Upload: lycong

Post on 11-Nov-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Introduction PetaBricks OpenTuner Conclusions

Autotuning Programs with Algorithmic Choice

Jason Ansel

MIT - CSAIL

December 18, 2013

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Performance search space:

Para

llelis

m c

ho

ices

Accuracy choices

A

lgorithmic choices

• Parallelism

• Exploiting parallelism isnecessary but not sufficient

• Performance is amulti-dimensional search problem

• Normally done by expertprogrammers

• Optimization decisions oftenchange program results

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Performance search space:

Para

llelis

m c

ho

ices

Accuracy choices

A

lgorithmic choices

• Parallelism Performance• Exploiting parallelism is

necessary but not sufficient

• Performance is amulti-dimensional search problem

• Normally done by expertprogrammers

• Optimization decisions oftenchange program results

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Performance search space:

Para

llelis

m c

ho

ices

Accuracy choices

A

lgorithmic choices

• Parallelism Performance• Exploiting parallelism is

necessary but not sufficient

• Performance is amulti-dimensional search problem

• Normally done by expertprogrammers

• Optimization decisions oftenchange program results

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Goal of this workTo automate the process of program optimization to createprograms that can adapt to changing environments and goals.

• Language level solutions for concisely representing algorithmicchoice spaces.

• Processes and compilation techniques to manage and explorethese spaces.

• Autotuning techniques to efficiently solve these searchproblems.

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Goal of this workTo automate the process of program optimization to createprograms that can adapt to changing environments and goals.

• Language level solutions for concisely representing algorithmicchoice spaces.

• Processes and compilation techniques to manage and explorethese spaces.

• Autotuning techniques to efficiently solve these searchproblems.

Introduction PetaBricks OpenTuner Conclusions

Research Covered in This Talk

• The PetaBricks programming language: algorithmic choice atthe language level [PLDI’09]

• Language level support for variable accuracy [CGO’11]

• Automated construction of multigrid V-cycles [SC’09]

• Code generation and autotuning for heterogeneous CPU/GPUmix of parallel processing units [ASPLOS’13]

• Solution for input sensitivity based on adaptiveoverhead-aware classifiers [Under review]

• OpenTuner: an extensible framework for program autotuning[Under review]

• Won’t be talking about work in: ASPLOS’09, ASPLOS’12,GECCO’11, IPDPS’09, PLDI’11, and many others

Introduction PetaBricks OpenTuner Conclusions

Research Covered in This Talk

• The PetaBricks programming language: algorithmic choice atthe language level [PLDI’09]

• Language level support for variable accuracy [CGO’11]

• Automated construction of multigrid V-cycles [SC’09]

• Code generation and autotuning for heterogeneous CPU/GPUmix of parallel processing units [ASPLOS’13]

• Solution for input sensitivity based on adaptiveoverhead-aware classifiers [Under review]

• OpenTuner: an extensible framework for program autotuning[Under review]

• Won’t be talking about work in: ASPLOS’09, ASPLOS’12,GECCO’11, IPDPS’09, PLDI’11, and many others

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

• How would you write a fast sorting algorithm?

• Insertion sort• Quick sort• Merge sort• Radix sort

• Poly-algorithms

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

• How would you write a fast sorting algorithm?• Insertion sort• Quick sort• Merge sort• Radix sort

• Poly-algorithms

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

• How would you write a fast sorting algorithm?• Insertion sort• Quick sort• Merge sort• Radix sort

• Poly-algorithms

Introduction PetaBricks OpenTuner Conclusions

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

Introduction PetaBricks OpenTuner Conclusions

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

Introduction PetaBricks OpenTuner Conclusions

Why 15?

• Why 15?

• Dates back to at least 2000 (June 2000 SGI release)

• Still in current C++ STL shipped with GCC

• cutoff = 15 survived 10+ years

• In the source code for millions of C++ programs

• There is nothing the compiler can do about it

Introduction PetaBricks OpenTuner Conclusions

Why 15?

• Why 15?

• Dates back to at least 2000 (June 2000 SGI release)

• Still in current C++ STL shipped with GCC

• cutoff = 15 survived 10+ years

• In the source code for millions1 of C++ programs

• There is nothing the compiler can do about it

1Any C++ program with “#include <algorithm>”, conservative estimate based on:

http://c2.com/cgi/wiki?ProgrammingLanguageUsageStatistics

Introduction PetaBricks OpenTuner Conclusions

Is 15 The Right Number?

• The best cutoff (CO) changes

• Depends on competing costs:• Cost of computation (< operator, call overhead, etc)• Cost of communication (swaps)• Cache behavior (misses, prefetcher, locality)

• Sorting 100000 doubles with std::stable sort:• CO ≈ 200 optimal on a Phenom 905e (15% speedup)• CO ≈ 400 optimal on a Opteron 6168 (15% speedup)• CO ≈ 500 optimal on a Xeon E5320 (34% speedup)• CO ≈ 700 optimal on a Xeon X5460 (25% speedup)

• If the best cutoff has changed, perhaps best algorithm hasalso changed

Introduction PetaBricks OpenTuner Conclusions

Is 15 The Right Number?

• The best cutoff (CO) changes

• Depends on competing costs:• Cost of computation (< operator, call overhead, etc)• Cost of communication (swaps)• Cache behavior (misses, prefetcher, locality)

• Sorting 100000 doubles with std::stable sort:• CO ≈ 200 optimal on a Phenom 905e (15% speedup)• CO ≈ 400 optimal on a Opteron 6168 (15% speedup)• CO ≈ 500 optimal on a Xeon E5320 (34% speedup)• CO ≈ 700 optimal on a Xeon X5460 (25% speedup)

• If the best cutoff has changed, perhaps best algorithm hasalso changed

Introduction PetaBricks OpenTuner Conclusions

Algorithmic Choice

• Compiler’s hands are tied, it is stuck with 15

• Need a better way to represent algorithmic choices

• PetaBricks is the first language with support for algorithmicchoice

Introduction PetaBricks OpenTuner Conclusions

Sort in PetaBricks

Language

funct ion S o r tto out [ n ]from i n [ n ]{

e i t h e r {I n s e r t i o n S o r t ( out , i n ) ;

} or {Q u i c k S o r t ( out , i n ) ;

} or {MergeSort ( out , i n ) ;

} or {R a d i x S o r t ( out , i n ) ;

}}

⇒Representation

Decision treesynthesized by ourautotuner

Introduction PetaBricks OpenTuner Conclusions

Sort in PetaBricks

Language

funct ion S o r tto out [ n ]from i n [ n ]{

e i t h e r {I n s e r t i o n S o r t ( out , i n ) ;

} or {Q u i c k S o r t ( out , i n ) ;

} or {MergeSort ( out , i n ) ;

} or {R a d i x S o r t ( out , i n ) ;

}}

⇒Representation

Decision treesynthesized by ourautotuner

Introduction PetaBricks OpenTuner Conclusions

Decision Trees

Optimized for a Xeon E7340 (8 cores):

N < 600

N < 1420Insertion Sort

Quick Sort Merge Sort(2-way)

Introduction PetaBricks OpenTuner Conclusions

Decision Trees

Optimized for Sun Fire T200 Niagara (8 cores):

N < 1461

N < 2400

Merge Sort(4-way)

Merge Sort(2-way)

N < 75

Merge Sort(8-way)

Merge Sort(16-way)

Introduction PetaBricks OpenTuner Conclusions

Sort Algorithm Timings2

0

0.0005

0.001

0.0015

0.002

0.0025

0 250 500 750 1000 1250 1500 1750

Tim

e (s

)

Input Size

InsertionSortQuickSortMergeSortRadixSortAutotuned

2On an 8-way Xeon E7340 system

Introduction PetaBricks OpenTuner Conclusions

Iteration Order Choices

• Many other choices related to execution order• By rows?• By columns?• Diagonal? Reverse order? Blocked?• Parallel?

• Choices both within a single (possibly parallel)task and between different tasks

• This is main motivation for a new language asopposed to a library

Introduction PetaBricks OpenTuner Conclusions

Iteration Order Choices

• Many other choices related to execution order• By rows?• By columns?• Diagonal? Reverse order? Blocked?• Parallel?

• Choices both within a single (possibly parallel)task and between different tasks

• This is main motivation for a new language asopposed to a library

Introduction PetaBricks OpenTuner Conclusions

Synthesized Outer Control Flow

• PetaBricks programs have synthesized outer control flow• Declarative (data flow like) outer syntax• Imperative inner code

• Programs start as completely parallel

• Added dependencies restrict the space of legal executions

• May only access data explicitly depended on

Parallel loop

X . c e l l ( i ) from ( ) { . . . }

Sequential loop

X . c e l l ( i ) from (X . c e l l ( i −1) l e f t ) { . . . }

Introduction PetaBricks OpenTuner Conclusions

Matrix Multiply

transform M a t r i x M u l t i p l yto AB[ w, h ]from A [ c , h ] , B [ w, c ]{

AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b ){return dot ( a , b ) ;

}

}

to (AB. reg ion ( x , y , x + 4 , y + 4) out )from (A . reg ion ( 0 , y , c , y + 4) a ,

B . reg ion ( x , 0 , x + 4 , c ) b ){// . . . compute 4 x 4 b l o c k . . .

}}

Introduction PetaBricks OpenTuner Conclusions

Matrix Multiply

transform M a t r i x M u l t i p l yto AB[ w, h ]from A [ c , h ] , B [ w, c ]{

AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b ){return dot ( a , b ) ;

}

to (AB. reg ion ( x , y , x + 4 , y + 4) out )from (A . reg ion ( 0 , y , c , y + 4) a ,

B . reg ion ( x , 0 , x + 4 , c ) b ){// . . . compute 4 x 4 b l o c k . . .

}}

Introduction PetaBricks OpenTuner Conclusions

Strassen Matrix Multiply

t r a n s f o r m S t r a s s e nto AB[ n , n ]from A[ n , n ] , B [ n , n ]u s i n g M1[ n/2 , n /2 ] , M2[ n /2 , n /2 ] , M3[ n /2 , n /2 ] , M4[ n /2 , n /2 ] ,

M5[ n /2 , n /2 ] , M6[ n /2 , n /2 ] , M7[ n /2 , n /2 ]{

to (M1 m1)from (A . r e g i o n (0 , 0 , n /2 , n /2) a11 ,

A . r e g i o n ( n /2 , n /2 , n , n ) a22 ,B . r e g i o n (0 , 0 , n /2 , n /2) b11 ,B . r e g i o n ( n /2 , n /2 , n , n ) b22 )

u s i n g ( t1 [ n / 2 , n / 2 ] , t2 [ n /2 , n / 2 ] ) {spawn MatrixAdd ( t1 , a11 , a22 ) ;spawn MatrixAdd ( t2 , b11 , b22 ) ;sync ;S t r a s s e n (m1, t1 , t2 ) ;

}. . . .// Compute one quadrant o f output w i th s t r a s s e n decompos i t i onto (AB. r e g i o n ( n /2 , 0 , n , n /2) c12 ) from (M3 m3, M5 m5){

MatrixAdd ( c12 , m3, m5 ) ;}. . . .// Or , compute e l ement i n output d i r e c t l y ( same as l a s t s l i d e )AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b){

r e t u r n dot ( a , b ) ;}

}

Introduction PetaBricks OpenTuner Conclusions

Variable Accuracy Algorithms

• Many problems don’t have a single correct answer,optimizations often trade-off accuracy and performance.

• Soft computing• DSP algorithms• Iterative algorithms

• Variable accuracy, supported in the PetaBricks language, is afundamental part of algorithmic choice which enables newclasses of programs to be represented.

Introduction PetaBricks OpenTuner Conclusions

Variable Accuracy Algorithms

• Many problems don’t have a single correct answer,optimizations often trade-off accuracy and performance.

• Soft computing• DSP algorithms• Iterative algorithms

• Variable accuracy, supported in the PetaBricks language, is afundamental part of algorithmic choice which enables newclasses of programs to be represented.

Introduction PetaBricks OpenTuner Conclusions

K-Means Examplet r a n s f o r m kmeansfrom Po in t s [ n , 2 ] // Array o f p o i n t s ( each column

// s t o r e s x and y c o o r d i n a t e s )u s i n g Cen t r o i d s [ s q r t ( n ) , 2 ]to Ass ignments [ n ]{

// Rule 1 :// One p o s s i b l e i n i t i a l c o n d i t i o n : Random// s e t o f p o i n t sto ( C en t r o i d s . column ( i ) c ) from ( Po i n t s p ) {

c=p . column ( rand (0 , n ) )}

// Rule 2 :// Another i n i t i a l c o n d i t i o n : C en t e r p l u s i n i t i a l// c e n t e r s ( kmeans++)to ( C en t r o i d s c ) from ( Po i n t s p ) {

Cen t e rP l u s ( c , p ) ;}

// Rule 3 :// The kmeans i t e r a t i v e a l g o r i t hmto ( Ass ignments a ) from ( Po i n t s p , C en t r o i d s c ) {

w h i l e ( t r u e ) {i n t change ;A s s i g nC l u s t e r s ( a , change , p , c , a ) ;i f ( change==0) r e t u r n ; // Reached f i x e d po i n tNewC lu s t e rLoca t i on s ( c , p , a ) ;

}}

}

Introduction PetaBricks OpenTuner Conclusions

K-Means Example (Variable Accuracy)t r a n s f o r m kmeansa c c u r a c y m e t r i c kmeansaccuracya c c u r a c y v a r i a b l e kfrom Po in t s [ n , 2 ] // Array o f p o i n t s ( each column

// s t o r e s x and y c o o r d i n a t e s )u s i n g Cen t r o i d s [ k , 2 ]to Ass ignments [ n ]

...

// Rule 3 :// The kmeans i t e r a t i v e a l g o r i t hmto ( Ass ignments a ) from ( Po i n t s p , C en t r o i d s c ) {

f o r e n o u g h {i n t change ;A s s i g nC l u s t e r s ( a , change , p , c , a ) ;i f ( change==0) r e t u r n ; // Reached f i x e d po i n tNewC lu s t e rLoca t i on s ( c , p , a ) ;

}}

}t r a n s f o r m kmeansaccuracyfrom Ass ignments [ n ] , Po i n t s [ n , 2 ]to Accuracy{

Accuracy from ( Ass ignments a , Po i n t s p){r e t u r n s q r t (2∗n/ SumClus te rD i s tanceSquared ( a , p ) ) ;

}}

Introduction PetaBricks OpenTuner Conclusions

Semantics of Variable Accuracy

Running the accuracy metric on the output will return a valuethat, in expectation, exceeds the accuracy target more than Ppercent of the time.

• Expected distribution of accuracy measured during autotuningtime, not at runtime.

• When fixed accuracy code calls variable accuracy code, anaccuracy target must be specified.

• When variable accuracy code call code containing variableaccuracy components, only the outer most accuracy targetwill be honored.

Introduction PetaBricks OpenTuner Conclusions

Semantics of Variable Accuracy

Running the accuracy metric on the output will return a valuethat, in expectation, exceeds the accuracy target more than Ppercent of the time.

• Expected distribution of accuracy measured during autotuningtime, not at runtime.

• When fixed accuracy code calls variable accuracy code, anaccuracy target must be specified.

• When variable accuracy code call code containing variableaccuracy components, only the outer most accuracy targetwill be honored.

Introduction PetaBricks OpenTuner Conclusions

A Brief Multigrid Intro

• Used to iteratively solve PDEs over a gridded domain

• Relaxations update points using neighboring values (stencilcomputations)

• Restrictions and Interpolations compute new grid with coarseror finer discretization

Introduction PetaBricks OpenTuner Conclusions

Standard Cycle Shaps

• Cycle shapes effect accuracy andperformance

• Equation, accuracy target, data,and execution platform effectefficacy of different shapes

• Entire papers published about newcycle shapes!

• We fundamentally change thestatus quo in this domain

• Define the search space of cycleshapes once

• Autotune to find a cycle shapetailored to your problem

Introduction PetaBricks OpenTuner Conclusions

Standard Cycle Shaps

• Cycle shapes effect accuracy andperformance

• Equation, accuracy target, data,and execution platform effectefficacy of different shapes

• Entire papers published about newcycle shapes!

• We fundamentally change thestatus quo in this domain

• Define the search space of cycleshapes once

• Autotune to find a cycle shapetailored to your problem

Introduction PetaBricks OpenTuner Conclusions

Choice Space of Multigrid

Introduction PetaBricks OpenTuner Conclusions

Autotuned V-cycle Shapes

101

Gri

d S

ize

2048

1024

512

256

128

64

32

16

103

105

107

Gri

d S

ize

2048

1024

512

256

128

64

32

16

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

⇒ Grid size 2i

• Partition accuracy space into discrete levels

• Base space of candidate algorithms on optimal algorithmsfrom coarser level

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

⇒ Grid size 2i

• Partition accuracy space into discrete levels

• Base space of candidate algorithms on optimal algorithmsfrom coarser level

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

⇒ Grid size 2i

• Partition accuracy space into discrete levels

• Base space of candidate algorithms on optimal algorithmsfrom coarser level

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

⇒ Grid size 2i

• Partition accuracy space into discrete levels

• Base space of candidate algorithms on optimal algorithmsfrom coarser level

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

⇒Grid size 2i

• Partition accuracy space into discrete levels

• Base space of candidate algorithms on optimal algorithmsfrom coarser level

Introduction PetaBricks OpenTuner Conclusions

2D Poisson’s Equation (uses Multigrid)

1

2

4

8

16

32

64

100 1000 10000 100000 1e+06

Spe

edup

(x)

Input Size

Accuracy Level 109

Accuracy Level 107

Accuracy Level 105

Accuracy Level 103

Accuracy Level 101

2D Poisson’s equation

Introduction PetaBricks OpenTuner Conclusions

More Variable Accuracy Results

1

2

4

8

10 100 1000

Spe

edup

(x)

Input Size

Accuracy Level 0.95Accuracy Level 0.75Accuracy Level 0.50Accuracy Level 0.20Accuracy Level 0.10Accuracy Level 0.05

Clustering

1

10

100

1000

10000

10 100 1000 10000 100000 1e+06S

peed

up (

x)

Input Size

Accuracy Level 1.01Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4

Bin Packing

1

2

4

8

16

32

10 100 1000 10000

Spe

edup

(x)

Input Size

Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3

Image Compression

1

2

4

8

16

32

10 100 1000 10000 100000

Spe

edup

(x)

Input Size

Accuracy Level 109

Accuracy Level 107

Accuracy Level 105

Accuracy Level 103

Accuracy Level 101

3D Helmholtz

1

2

4

8

16

32

64

100 1000 10000 100000 1e+06

Spe

edup

(x)

Input Size

Accuracy Level 109

Accuracy Level 107

Accuracy Level 105

Accuracy Level 103

Accuracy Level 101

2D Poisson

1

2

4

8

10 100 1000 10000

Spe

edup

(x)

Input Size

Accuracy Level 3.0Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.5Accuracy Level 0.0

Preconditioner

Introduction PetaBricks OpenTuner Conclusions

Results on Different Systems

Test SystemsCodename CPU(s) Cores GPU OpenCL Runtime

Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2

Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5

Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2

BenchmarksName

# PossibleConfigs

Generated OpenCLKernels

MeanAutotuning Time

TestingInput Size

SeparableConv. 101358 9 3.82 hours 35202

Black-Sholes 10130 1 3.09 hours 500000

Poisson2D SOR 101358 25 15.37 hours 20482

Sort 10920 7 3.56 hours 220

Strassen 101509 9 3.05 hours 10242

SVD 102435 8 1.79 hours 2562

Tridiagonal Solver 101040 8 5.56 hours 10242

Introduction PetaBricks OpenTuner Conclusions

Results on Different Systems

Test SystemsCodename CPU(s) Cores GPU OpenCL Runtime

Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2

Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5

Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2

BenchmarksName

# PossibleConfigs

Generated OpenCLKernels

MeanAutotuning Time

TestingInput Size

SeparableConv. 101358 9 3.82 hours 35202

Black-Sholes 10130 1 3.09 hours 500000

Poisson2D SOR 101358 25 15.37 hours 20482

Sort 10920 7 3.56 hours 220

Strassen 101509 9 3.05 hours 10242

SVD 102435 8 1.79 hours 2562

Tridiagonal Solver 101040 8 5.56 hours 10242

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x

2.0x

3.0x

Desktop

Execution T

ime (

Norm

aliz

ed) Desktop Config

Server ConfigLaptop Config

Desktop Config Server Config Laptop Config

SeparableConv.1D kernel+local memory onGPU

1D kernel on OpenCL2D kernel+local memory onGPU

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x

2.0x

3.0x

Desktop Server Laptop

Execution T

ime (

Norm

aliz

ed) Desktop Config

Server ConfigLaptop Config

Desktop Config Server Config Laptop Config

SeparableConv.1D kernel+local memory onGPU

1D kernel on OpenCL2D kernel+local memory onGPU

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x

2.0x

3.0x

Desktop Server Laptop

Execution T

ime (

Norm

aliz

ed) Desktop Config

Server ConfigLaptop Config

Hand-coded OpenCL

Desktop Config Server Config Laptop Config

SeparableConv.1D kernel+local memory onGPU

1D kernel on OpenCL2D kernel+local memory onGPU

Introduction PetaBricks OpenTuner Conclusions

Poisson 2D SOR

1.0x

3.0x

5.0x

7.0x

9.0x

Desktop Server Laptop

Execution T

ime (

Norm

aliz

ed) Desktop Config

Server ConfigLaptop Config

Desktop Config Server Config Laptop Config

Poisson2D SORSplit on CPU followed bycompute on GPU

Split some parts on OpenCLfollowed by compute on CPU

Split on CPU followed bycompute on GPU

Introduction PetaBricks OpenTuner Conclusions

Singular Value Decomposition (SVD)

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

Desktop Server Laptop

Execution T

ime (

Norm

aliz

ed) Desktop Config

Server ConfigLaptop Config

Desktop Config Server Config Laptop Config

SVD

First phase: task parallism be-tween CPU/GPU; matrix multi-ply: 8-way parallel recursive de-composition on CPU, call LA-PACK when < 42× 42

First phase: all on CPU; ma-trix multiply: 8-way parallel re-cursive decomposition on CPU,call LAPACK when < 170×170

First phase: all on CPU; ma-trix multiply: 4-way parallel re-cursive decomposition on CPU,call LAPACK when < 85× 85

Introduction PetaBricks OpenTuner Conclusions

Results Takeaways

• Different configurations are required for best performance ondifferent systems

• Not just changing block sizes

• Can not be easily solved by a simple heuristic

• Motivates the need for algorithmic choice and autotuning

Introduction PetaBricks OpenTuner Conclusions

Autotuning Challenges

• Evaluating quality of candidate algorithms is expensive• Must run the program (at least once)• More expensive for unfit solutions• Scales poorly with larger problem sizes

• Fitness is noisy• Randomness from parallel races and system noise• Testing each candidate only once often produces a worse

algorithm• Running many trials is expensive

• Decision tree structures are complex• Not easy to hill-climb• We artificially bound them

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity

• Input sensitivity is a major challenge

• Different algorithms may be better for different inputs

• Use fast algorithm for easy inputs, slow algorithm for hardinputs

• Avoid pathological cases

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Today

• Vast majority of programs today use a single algorithm for allinputs

• This forces design for the “worst case” input• Wastes time and resources

• Related work:• Uses hand written heuristics to adapt to inputs• Rectify inputs for security [Long el al.]

• Our system automatically classifies inputs and runs a programoptimized for the type of input being processed

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Today

• Vast majority of programs today use a single algorithm for allinputs

• This forces design for the “worst case” input• Wastes time and resources

• Related work:• Uses hand written heuristics to adapt to inputs• Rectify inputs for security [Long el al.]

• Our system automatically classifies inputs and runs a programoptimized for the type of input being processed

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Overview

TrainingDeployment

Input Classifier

Input Aware Learning

Program

Training Inputs

Feature ExtractorsInsights: - Feature Priority List - Performance Bounds

Input

Select Input Optimized Programs

Training

Selected Program

Run

Introduction PetaBricks OpenTuner Conclusions

Input Featuresf u n c t i o n Sor tto out [ n ]from i n [ n ]

i n p u t f e a t u r e So r t edne s s , Dup l i c a t i o n

{ . . . }f u n c t i o n So r t e dn e s sfrom i n [ n ]to s o r t e d n e s st u n a b l e doub l e l e v e l ( 0 . 0 , 1 . 0 ){

i n t s o r t e d coun t = 0 ;i n t count = 0 ;i n t s t e p = ( i n t ) ( l e v e l ∗n ) ;f o r ( i n t i =0; i+step<n ; i+=s t ep ) {

i f ( i n [ i ] <= in [ i+s t ep ] ) {// inc r ement f o r c o r r e c t l y o r d e r ed// p a i r s o f e l ement ss o r t e d coun t += 1 ;

}count += 1 ;

}i f ( count > 0)

s o r t e d n e s s = so r t e d coun t / ( doub l e ) count ;e l s e

s o r t e d n e s s = 0 . 0 ;}f u n c t i o n Dup l i c a t i o nfrom i n [ n ]to d u p l i c a t i o n{ . . . }

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication

Sortedness

Introduction PetaBricks OpenTuner Conclusions

Training

FeaturesInput

Labels

Decision Tree

Max A Priori

Adaptive Tree

Classifier Constructors

1...m

0

m+1

Classifier Selector

Selection Objective

Considers cost of

extracting needed features

Input Classifier

Introduction PetaBricks OpenTuner Conclusions

How Many Landmarks Are Enough?

0 0.2 0.4 0.6 0.8 1

Lost

spe

edup

(L)

Size of region (pi)

2 configs3 configs4 configs5 configs6 configs7 configs8 configs9 configs

10 20 30 40 50 60 70 80 90 100S

peed

upLandmarks

Introduction PetaBricks OpenTuner Conclusions

Input Adaptation Results

sort

1 100

1

3

5

Landmarks

Spe

edup

clustering

1 100

2

3

Landmarks

Spe

edup

binpacking

1 100

1

1.05

Landmarks

Spe

edup

svd

1 100

0.9

1.1

Landmarks

Spe

edup

poisson2d

1 100

0.9

1.2

Landmarks

Spe

edup

helmholtz3d

1 100

0.8

1.0

Landmarks

Spe

edup

Introduction PetaBricks OpenTuner Conclusions

Related Projects

A small selection of many related projects:Package Domain Search Method

Active Harmony Runtime System Nelder-Mead

ATLAS Dense Linear Algebra Exhaustive

Code Perforation Compiler Exhaustive + Simulated Annealing

Dynamic Knobs Runtime System Control Theory

FFTW Fast Fourier Transform Exhaustive / Dynamic Prog.

Insieme Compiler Differential Evolution

Milepost GCC / cTuning Compiler IID Model + Central DB

OSKI Sparse Linear Algebra Exhaustive + Heuristic

PATUS Stencil Computations Nelder-Mead or Evolutionary

SEEC / Heartbeats Runtime System Control Theory

Sepya Stencil Computations Random-Restart Gradient Ascent

SPIRAL DSP Algorithms Pareto Active Learning

• Simple techniques (exhaustive, hill climbers, etc) are popular• No single technique is best for all problems

• Representations are often just integers/floats/booleans

Introduction PetaBricks OpenTuner Conclusions

Related Projects

A small selection of many related projects:Package Domain Search Method

Active Harmony Runtime System Nelder-Mead

ATLAS Dense Linear Algebra Exhaustive

Code Perforation Compiler Exhaustive + Simulated Annealing

Dynamic Knobs Runtime System Control Theory

FFTW Fast Fourier Transform Exhaustive / Dynamic Prog.

Insieme Compiler Differential Evolution

Milepost GCC / cTuning Compiler IID Model + Central DB

OSKI Sparse Linear Algebra Exhaustive + Heuristic

PATUS Stencil Computations Nelder-Mead or Evolutionary

SEEC / Heartbeats Runtime System Control Theory

Sepya Stencil Computations Random-Restart Gradient Ascent

SPIRAL DSP Algorithms Pareto Active Learning

• Simple techniques (exhaustive, hill climbers, etc) are popular• No single technique is best for all problems

• Representations are often just integers/floats/booleans

Introduction PetaBricks OpenTuner Conclusions

Limits of Existing Autotuning Projects

• We believe these factors limit the scope and efficiency ofautotuning

• A hill climber works great for a block size, but completely failsat synthesizing poly-algorithms

• Many users of autotuning work hard to prune their searchspaces to fit their techniques

• OpenTuner provides extensible representations and ensemblesof techniques which can solve more complex autotuningproblems

Introduction PetaBricks OpenTuner Conclusions

Limits of Existing Autotuning Projects

• We believe these factors limit the scope and efficiency ofautotuning

• A hill climber works great for a block size, but completely failsat synthesizing poly-algorithms

• Many users of autotuning work hard to prune their searchspaces to fit their techniques

• OpenTuner provides extensible representations and ensemblesof techniques which can solve more complex autotuningproblems

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Overview

OpenTuner: an extensible framework for program autotuning

Results Database

Search TechniquesSearch

Driver

Search

Reads: ResultsWrites: Desired Results

Measurement

User Defined Measurement

Function

Measurement Driver

Configuration Manipulator

Reads: Desired ResultsWrites: Results

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Configuration Manipulator Parameters

Parameter

Primitive Complex

Integer ScaledNumericFloat

LogInteger LogFloat PowerOfTwo

Switch Enum Permutation

Schedule

SelectorBoolean

• Hierarchical structure of parameters, user defined parametertypes can be added at any point

• Primitive parameters behave like bounded integers or floats

• Complex parameters have a set of stochastic mutationoperators

• Technique-specific operators

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Information sharing through ResultsDB

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Information sharing through ResultsDB

AUC Bandit

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Information sharing through ResultsDB

AUC Bandit

Which configuration should we try next?

?

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Information sharing through ResultsDB

AUC Bandit

Which configuration should we try next?

33%

Exploration

33% 33%

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution

Particle Swarm

Optimization

TorczonHill

Climber

Information sharing through ResultsDB

AUC Bandit

Which configuration should we try next?

100%

Exploitation

0% 0%

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results

Project Benchmark Possible Configurations

GCC/G++ Flags all 10806

Halide Blur 1052

Halide Wavelet 1044

HPL n/a 109.9

PetaBricks Poisson 103657

PetaBricks Sort 1090

PetaBricks Strassen 10188

PetaBricks TriSolve 101559

Stencil all 106.5

Unitary n/a 1021

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results

Project Benchmark Possible Configurations

GCC/G++ Flags all 10806

Halide Blur 1052

Halide Wavelet 1044

HPL n/a 109.9

PetaBricks Poisson 103657

PetaBricks Sort 1090

PetaBricks Strassen 10188

PetaBricks TriSolve 101559

Stencil all 106.5

Unitary n/a 1021

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results: GCC Flags

fft.c

0.8

0.85

0.9

0.95

1

0 300 600 900 1200 1500 1800

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

gcc -O1gcc -O2gcc -O3

OpenTuner

matrixmultiply.cpp

0.1

0.15

0.2

0.25

0.3

0 300 600 900 1200 1500 1800

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

g++ -O1g++ -O2g++ -O3

OpenTuner

raytracer.cpp

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 300 600 900 1200 1500 1800

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

g++ -O1g++ -O2g++ -O3

OpenTuner

tsp ga.cpp

0.4

0.45

0.5

0.55

0.6

0 300 600 900 1200 1500 1800

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

g++ -O1g++ -O2g++ -O3

OpenTuner

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results: PetaBricks

Poisson 2D

0 0.02 0.04 0.06 0.08

0.1 0.12 0.14 0.16 0.18

0.2

0 600 1200 1800 2400 3000 3600

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

PetaBricks AutotunerOpenTuner

Sort

0

0.02

0.04

0.06

0.08

0.1

0 600 1200

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

PetaBricks AutotunerOpenTuner

Strassen

0

0.05

0.1

0.15

0.2

0 600 1200

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

PetaBricks AutotunerOpenTuner

Tridiagonal Solver

0.0095

0.01

0.0105

0.011

0.0115

0 600 1200

Exe

cutio

n Ti

me

(sec

onds

)

Autotuning Time (seconds)

PetaBricks AutotunerOpenTuner

Introduction PetaBricks OpenTuner Conclusions

Conclusions

• PetaBricks has pushed the limits of what can be done withalgorithmic choice

• Provides performance portability by allowing programs toadapt to their environment

• Have shown: variable accuracy, multigrid, and input sensitivity• Hope that future main stream programming languages will

incorperate algorithmic choice and autotuning

• OpenTuner can expand the scope of program autotuning forother projects

• Extensible configuration representation• Ensembles of techniques• Hope that field of autotuning will expand to much more

complex problems

Introduction PetaBricks OpenTuner Conclusions

Coauthors and Collaborators

• Saman Amarasinghe

• Cy Chan

• Yufei Ding

• Alan Edelman

• Sam Fingeret

• Sanath Jayasena

• Shoaib Kamil

• Kevin Kelley

• Erika Lee

• Deepak Narayanan

• Marek Olszewski

• Una-May O’Reilly

• Maciej Pacula

• Phitchaya Mangpo Phothilimthana

• Jonathan Ragan-Kelley

• Xipeng Shen

• Michele Tartara

• Kalyan Veeramachaneni

• Yod Watanaprakornku

• Yee Lok Wong

• Kevin Wu

• Minshu Zhan

• Qin Zhao

Introduction PetaBricks OpenTuner Conclusions

Thanks!

About me:http://jasonansel.com/

http://opentuner.org/

http://projects.csail.mit.edu/petabricks/