torsten hoefler an overview of static & dynamic techniques … · 2018. 1. 27. ·...

67
spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic Performance Modeling All images belong to their creator! in collaboration with Alexandru Calotoiu and Felix Wolf @ RWTH Aachen with students Arnamoy Bhattacharyya and Grzegorz Kwasniewski @ SPCL presented at ISC 2016, Frankfurt, July 2016

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

TORSTEN HOEFLER

An Overview of Static & Dynamic Techniques

for Automatic Performance Modeling

All images belong to their creator!

in collaboration with Alexandru Calotoiu and Felix Wolf @ RWTH Aachen

with students Arnamoy Bhattacharyya and Grzegorz Kwasniewski @ SPCL

presented at ISC 2016, Frankfurt, July 2016

Page 2: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Original findings:

If carefully tuned, NBC speeds up

a 3D solver

Full code published

8003 domain – 4 GB array

1 process per node, 8-96 nodes

Opteron 246 (old even in 2006,

retired now)

Super-linear speedup for 96 nodes

~5% better than linear

9 years later: attempt to

reproduce !

System A: 28 quad-core nodes,

Xeon E5520

System B: 4 nodes, dual

Opteron 6274“Neither the experiment in A nor the one in B could

reproduce the results presented in the original paper,

where the usage of the NBC library resulted in a

performance gain for practically all node counts, reaching

a superlinear speedup for 96 cores (explained as being

due to cache effects in the inner part of the matrix vector

product).” 2

(2006)

(2015)

A

B1 node

(system

B)

My sinful youth

Page 3: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Assume you want to get into the top500

Run HPL until you get what you hope for – 77 Tflop/s

3

How to report a performance result?

@SC’15

Page 4: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Scalability bug prediction

Find latent scalability bugs early on (before machine deployment)SC13: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

Automated performance testing

Performance modeling as part of a software engineering discipline in HPCICS’15: S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?

Hardware/Software co-design

Decide how to architect systems

Making performance development intuitive

4

Analytical application performance modeling

vs.

Page 5: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Disadvantages

• Time consuming

• Error-prone, may overlook unscalable code

5

Manual analytical performance modeling

Identify kernels

• Parts of the program that dominate its performance at larger scales

• Identified via small-scale tests and intuition

Create models

• Laborious process

• Still confined to a small community of skilled experts

TH, W. Gropp, M. Snir, and W. Kramer: Performance Modeling for Systematic Performance Tuning, SC11

Page 6: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

p4 = 1,024

p5 = 2,048

p6 = 4,096

6

Our first step: scalability bug detector

main() {

foo()

bar()

compute()

}

Instrumentation

Performance measurements (profiles)

Input

Output

1. foo

2. compute

3. main

4. bar

[…]

Ranking:

1. Asymptotic

2. Target scale pt

p1 = 128

p2 = 256

p3 = 512

Automated

modeling

• All functions

Weak s

calin

g

Page 7: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

7

Primary focus on scaling trend

Our ranking

1. F1

2. F3

3. F2

Common performance

analysis chart in a

not-so-great paper

Page 8: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

8

Primary focus on scaling trend

Our ranking

Actual measurement in

laboratory conditions 1. F1

2. F3

3. F2

Page 9: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

9

Primary focus on scaling trend

Our ranking

Production Reality1. F1

2. F3

3. F2

Page 10: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

10

Survey result: performance model normal form

f (p) = ck × pik × log2

jk (p)k=1

n

ån Î

ik Î I

jk Î J

I, J Ì

n =1

I = 0,1, 2{ }

J = {0,1}

c1

c1 × p

c1 × p2

c1 × log(p)

c1 × p × log(p)

c1 × p2 × log(p)

A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, SC13

Page 11: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, SC13

n Î

ik Î I

jk Î J

I, J Ì

11

Survey result: performance model normal form

f (p) = ck × pik × log2

jk (p)k=1

n

å

n = 2

I = 0,1, 2{ }

J = {0,1}

c1 + c2 × p

c1 + c2 × p2

c1 + c2 × log(p)

c1 + c2 × p × log(p)

c1 + c2 × p2 × log(p)

)log(

)log()log(

)log(

)log(

)log(

)log()log(

)log(

)log()log(

)log(

2

2

2

1

2

21

2

21

2

21

2

21

21

2

21

2

21

21

21

ppcpc

ppcppc

pcppc

ppcpc

pcpc

ppcpc

ppcpc

pcpc

ppcpc

pcpc

Page 12: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

12

Our automated generation workflow

Performance

measurements

Performance

profiles

Model

generation

Scaling

models

Performance

extrapolation

Ranking of

kernels

Statistical

quality assurance

Model

generation

Accuracy

saturated?

Model

refinementScaling

models

Yes

NoKernel

refinement

A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, SC13

Page 13: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

13

Model refinement

Hypothesis generation;

hypothesis size n

Scaling model

Input data

Hypothesis evaluation

via cross-validation

Computation of

for best hypothesis

No

Yes

n =1;R0

2

= -¥

Rn2

n++Rn-1

2

> Rn2

Ú

n = nmax

{(p1, t1),..., (p6, t6 )}

c1 × log(p)

c1 × p × log(p)

c1 × p2 × log(p)

c1

c1 × p

c1 × p2

16)1(1

1

22

2

n

nRR

uarestotalSumSq

mSquaresresidualSuR

I = {0,1,2};J = {0,1};nmax = 2

c1 × log(p)

Page 14: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

We face several problems:

Multiparameter modeling – search space explosion

Interesting instance of the curse of dimensionality

Modeling overheads

Cross validation (leave-one-out) is slow and

Our current profiling requires a lot of storage (>TBs)

14

Is this all? No, it’s just the beginning …

Page 15: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Structures that determine program scalability

LOOPS

Assumption:

Other instructions do not influence it

Example:

for (x=0; x < n/p; x++)

for (y=1; y < n; y=2*y )

veryComplicatedOperation(x,y);

15

Static analysis of explicitly parallel programs

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 16: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Affine loops

Perfectly nested affine loops

16

Counting arbitrary affine loop nests

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 17: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

17

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 18: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

18

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 19: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

19

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 20: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

20

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 21: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

21

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 22: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

22

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 23: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Example

23

Counting arbitrary affine loop nests

for (j=1; j < n/p + 1; j= j*2)

for (k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

k

jxwhere

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 24: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

24

Overview of the whole system

with

Loop extractionAffine loop synthesisClosed form representation

Number of iterations

rkxni

iibxiiAiix

kkr

rfinalrfinalr

...1),(...0

),...,(),...,(),...,(

,0

1011

p

p

ND

NW1

LLVMParallel program

Program analysis

Page 25: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Well, what about non-affine loops?

More general abstract interpretation (next step)

Not solvable → will always have undefined terms

Back to PMNF?

Generalize to multiple input parameters

a) Bigger search-space

b) Bigger trace files

Ad-hoc (partial) solution: online machine learning – PEMOGEN

Replace cross-validation with LASSO (regression with L1 regularizer)

Much cheaper!

Replace LASSO with online LASSO [1]

No traces!

25

What problems are remaining?

[1]: P. Garrigues and L. El Ghaoui. An homotopy algorithm for the Lasso with online observations. NIPS 2008

Page 26: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Also integrated into LLVM compiler

Automatic kernel detection and instrumentation (Loop Call Graph)

Static dataflow analysis reduces parameter space for each kernel

26

PEMOGEN – static analysis

Quality: NAS UA and Mantevo MiniFE Overhead: Mantevo

A. Bhattacharyya, T. Hoefler: PEMOGEN: Automatic Adaptive Performance Modeling during Program Runtime, PACT’14

Page 27: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Use-case A: automatic testing (Allreduce time)

Divergence on Piz

Daint is O(p0.67),

the highest of all

three

S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?, ICS’15

Page 28: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Use-case B: automatic testing (MPI memory size)

Linear memory

consumption on

Juropa

ParaStation MPI

uses RC over IB

S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?, ICS’15

Page 29: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

29

Page 30: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

30

Performance Analysis 2.0 – Automatic Models

• Is feasible

Still a long way to go …

• Offers insight

• Requires low effort

• Improves code coverage

A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance

Modeling to Find Scalability Bugs in Complex Codes. Supercomputing (SC13).

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel

Programs. SPAA 2014.

A. Bhattacharyya, T. Hoefler: PEMOGEN: Automatic Adaptive Performance

Modeling during Program Runtime, PACT 2014

S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, F. Wolf: Exascaling Your Library:

Will Your Implementation Meet Your Expectations? ICS 2015

Page 31: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

31

Backup

Page 32: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

32

How to mechanize the expert? → Survey! C

om

puta

tio

nC

om

mu

nic

atio

n

Samplesort

t(p) ~ p2

Naïve N-body

t(p) ~ p

FFT)(log~)( 2 ppt

LU

t(p) ~ c

Samplesort

t(p) ~ p2 log2

2(p)

Naïve N-body

t(p) ~ p

FFT)(log~)( 2 ppt

LU

t(p) ~ c

… …

Page 33: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

33

Page 34: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

34

I = {02, 1

2, 2

2, 3

2, 4

2, 5

2, 6

2}

J = {0,1,2}

n = 5

✔✔

Sweep3D

MILC

HOMME

XNS

Performance

measurements

Performa

nce

profiles

Model

generation

Scaling

models

Performance

extrapolation

Ranking

of kernels

Statistical

quality assurance

Model

generation

Accuracy

saturated?

Model

refinementScaling

models

Yes

No

Kernel

refinement

Evaluation overview

Page 35: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Solves neutron transport

problem

3D domain mapped onto

2D process grid

Parallelism achieved through

pipelined wave-front process

LogGP model for communication developed by Hoisie et al.

We assume p=px*py → Equation (6) in [1]

35

Sweep3D communication performance

pct comm

[1] A. Hoisie, O. M. Lubeck, and H. J. Wasserman. Performance analysis of wavefront algorithms on very-large scale distributed

systems. In Workshop on Wide Area Networks and High Performance Computing, pages 171–187. Springer-Verlag, 1999.

Page 36: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

36

Sweep3D communication performance

Kernel

[2 of 40]

Runtime[%]

pt=262k

Model [s]

t = f(p)

Predictive error [%]

pt=262k

sweep → MPI_Recv

sweep

65.35

20.87

5.10

0.01

4.03 p

582.19 #bytes = const.

#msg = const.

pi £ 8k

Page 37: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

MILC/su3_rmd – from MILC suite of QCD

codes with performance model manually

created

• Time per process should remain constant

except for a rather small logarithmic term

caused by global convergence checks

37

MILC

Kernel

[3 of 479]

Model [s]

t=f(p)

Predictive

Error [%]

pt=64k

compute_gen_staple_field

g_vecdoublesum → MPI_Allreduce

mult_adj_su3_fieldlink_lathwec

2.40 ×10-2

6.30 ×10-6 × log2

2(p)

3.80 ×10-3

0.43

0.01

0.04

pi £16k

Page 38: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Core of the Community Atmospheric

Model (CAM)

Spectral element dynamical core

on a cubed sphere grid

38

HOMME

Kernel

[3 of 194]

Model [s]

t = f(p)

Predictive error [%]

pt = 130k

box_rearrange →

MPI_Reduce

vlaplace_sphere_vk

compute_and_apply_rhs

0.026+2.53×10-6p × p+ 1.24 ×10-12p3

49.53

48.68

57.02

99.32

1.65

pi £15k

Page 39: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Core of the Community Atmospheric

Model (CAM)

Spectral element dynamical core

on a cubed sphere grid

39

HOMME (2)

Kernel

[3 of 194]

Model [s]

t = f(p)

Predictive error [%]

pt = 130k

box_rearrange →

MPI_Reduce

vlaplace_sphere_vk

compute_and_apply_rhs

3.63×10-6p × p+ 7.21×10-13p3

pi £ 43k

30.34

4.28

0.83

24.44+2.26 ×10-7p2

49.09

Page 40: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

40

HOMME (3)

Page 41: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Wall-clock time not necessarily monotonically increasing –

harder to capture model automatically

• Different invariants require different reductions across processes

Superlinear speedup through cache effects

• Measure and model re-use distance?

41

What about strong scaling?

Weak scaling Strong scaling

Invariant Problem size per process Overall problem size

Model target Wall-clock time Accumulated time

Reduction Maximum / average Sum

Page 42: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Finite element flow simulation

program with numerous

equations represented:

• Advection diffusion

• Navier-Stokes

• Shallow water

Strong scaling analysis

• P = {128; …; 4,096}

• 5 measurements per pi

• Using accumulated time across processes as metric

42

XNS

Page 43: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

43

XNS (2)

KernelRuntime[%]

p=128

Runtime[%]

p=4,096

Model [s]

t = f(p)

ewdgennprm->MPI_Recv

ewddot

51.46

5.04

0.029 × p2

37406.80+13.29 × p × log(p)

0.46

44.78

Accumulated time Wallclock time

#bytes = ~p

#msg = ~p

Page 44: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

44

Step back – what do we really care about?

1TW

TD

p

ppT

TE 1

start

end

Depth

Parallel efficiency

Work

Page 45: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Polyhedral model

45

Related work: counting loop iterations

piplib

PolyLib

PPL

Polly

R. Karp, R. Miller, and S. Winograd. The organization of computations for uniform recurrence equations. J. ACM, 14(3):563–590, July 1967.

Page 46: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Polyhedral model

46

Related work: counting loop iterations

for (j = 1; j <= n; j = j*2)

for (k = j; k <= n; k = k++)

veryComplicatedOperation(j,k);

2

)1(

,

,1

nnN

njk

nj

2log)1( 2 nnnN

A.I. Barvinok. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed, Math. Oper. Res., 1994

Page 47: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

When the polyhedral model cannot handle it

47

Related work: counting loop iterations

j=10;

k=10;

while (j>0){

j=j+k;

k--;

}

?

Page 48: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Closed form representation of a loop

Single affine statement

Counting function

48

Algorithm in details

pLxx

)( 0xn

)(

;0

gxcwhile

xx

T

;bAxx

1

0

00 ),(i

j

ji bAxAxix

)),((minarg),,(

)()(),(

00

00

gxixcgcxn

ipxiLxix

T

i

;0

0

11

01

xx

0

0

1

01

0

0

11

01

11

01),( 0

1

0

00 xi

xxixi

j

ji

}

){10(

;0

0

01

01

mxwhile

xx

0

00 )(

j

kmxn

Example

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 49: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Folding the loops

49

Algorithm in details

;0

0

11

01

xx

){01(

;0

1

10

00

pnxwhile

xx

}

){10(

;0

0

01

01

mxwhile

xx

;0

0

10

02}

xx

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 50: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Folding the loops

50

Algorithm in details

;0

0

11

01

xx

){01(

;0

1

10

00

pnxwhile

xx

}

){10(

;0

0

01

01

mxwhile

xx

;0

0

10

02}

xx

){01(

;0

1

10

00

pnxwhile

xx

}

;0

0

01

01

xx

;0

0

10

02

xx

;0

0

1

01

x

ix

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 51: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Folding the loops

51

Algorithm in details

;0

0

11

01

xx

){01(

;0

1

10

00

pnxwhile

xx

}

){10(

;0

0

01

01

mxwhile

xx

;0

0

10

02}

xx

){01(

;0

1

10

00

pnxwhile

xx

}

;0

0

01

01

xx

;0

0

1

01

x

ix

;0

0

10

02

xx

){01(

;0

1

10

00

pnxwhile

xx

}

;0

0

01

02

x

ix

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 52: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Starting conditions

52

Algorithm in details

){(

;

11

0

gxcwhile

xx

T

;33 bxAx

){(

;

22

11

gxcwhile

bxAx

T

;} 11 vxUx

}

){(

;

33

22

gxcwhile

bxAx

T

;} 22 vxUx

3,0x

2,0x

1,0x

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 53: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Counting the number of iterations

We have:

53

Algorithm in details

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 54: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Counting the number of iterations

We have:

The closed form for each loop:

• Single affine statement

• Counting function

Starting condition for each loop

54

Algorithm in details

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 55: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Counting the number of iterations

We have:

The closed form for each loop:

• Single affine statement

• Counting function

Starting condition for each loop

Number of iterations:

55

Algorithm in details

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 56: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Counting the number of iterations

The equation computes the precise number of iterations

56

Algorithm in details

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 57: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Counting the number of iterations

The equation gives precise number of iterations

But simplification may fail → Sum approximation

• Approximate sums by integrals

→ lower and upper bounds

57

Algorithm in details

T. Hoefler, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, SPAA’14

Page 58: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

58

Solving more general problems

Page 59: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Multipath loops

59

Solving more general problems

Page 60: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Multipath loops

Conditional statements

60

Solving more general problems

Page 61: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Multipath loops

Conditional statements

Non-affine loops

61

Solving more general problems

rowstr(j+1)-1-rowstr(j)=u

Page 62: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

NAS Parallel Benchmarks: EP

62

Case studies

Page 63: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

NAS Parallel Benchmarks: EP

63

Case studies

Page 64: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

64

Case studies

CG – conjugate gradient

pkp

mk

p

mkp

kE

TD

pkp

mk

p

mkW

p

2321

4

2321

log

log

IS – integer sort

32

1

2123

kp

mkpp

kE

TD

uupp

mtbnW

p

Page 65: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Why affine loops?

Closed form representation of the loop

65

Counting Arbitrary Affline Loop Nests

)),((minarg),,(

)()(),(

00

00

gxdxcgcxn

ipxiLxix

T

d

Page 66: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Why affine loops?

Closed form representation of the loop

Example

66

Counting Arbitrary Affline Loop Nests

)),((minarg),,(

)()(),(

00

00

gxdxcgcxn

ipxiLxix

T

d

;0

0

11

01

xx

0

0

1

01),( 00 x

ixix

}

){10(

;0

0

01

01

mxwhile

xx

0

00 )(

j

kmxn

for ( k=j; k < m; k = k + j )

veryComplicatedOperation(j,k);

0

0

01

01

0

0

0 xk

jxwhere

Page 67: TORSTEN HOEFLER An Overview of Static & Dynamic Techniques … · 2018. 1. 27. · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER An Overview of Static & Dynamic Techniques for Automatic

spcl.inf.ethz.ch

@spcl_eth

Multipath affine loops

67

Loops