tools for engineering analysis of high performance parallel programs david culler, frederick wong,...

20
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/ ~culler/talks

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

Tools for Engineering Analysis of High Performance Parallel Programs

David Culler,

Frederick Wong, Alan Mainwaring

Computer Science Division

U.C.Berkeley

http://www.cs.berkeley.edu/~culler/talks

Page 2: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 2

Traditional Parallel Programming Tools

• Focus on showing “what program did” and “when it did it”– microscopic analysis of deterministic

events

– oriented towards initial development of small programs on small data sets and small machines

• Instrumentation– traces, counters, profiles

• Visualization

• Examples– AIMS, PTOOLS, PPP

– pablo + paradyn + ... => delphi

– ACTS TAU - tuning and analysis util.

Page 3: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 3

Example: Pablo

Page 4: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 4

Beyond Zeroth-order Analysis

• Basic level to get to a system design that is reasonable and behaves properly under “ideal condition”

• Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior

• Combine empirical data with analytical models

• Iterate

• from What? to What if?

Wind Speed

max

dis

pla

cem

en

t

Page 5: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 5

Approach: Framework for Parameterized Sensitivity Analsys

• framework performs analysis over numerous runs– statistical filtering

– vary parameter of interest

• provides means of combining data to isolate effects of interest

=> ROBUSTNESS

Well-developedParallel Program

StudyParameter

Problem Data SetGenerator

InstrumentationTools

MachineCharacterizers

visualization, modeling

• Procs

• Comm. perf.

• Cache

• Scheduling

• ...

Page 6: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 6

Simplest Example: Performance( P )

• NPB2.2 on NOW and Origin 2000 (250)

Origin Speedup

048

12162024283236404448

0 4 8 12 16 20 24 28 32 36 40 44 48

Machine Size (Processors)

Spee

dup

BT

SP

LU

MG

FT

IS

Ideal

Cluster Speedup

048

12162024283236404448

0 4 8 12 16 20 24 28 32 36 40 44 48

Machine Size (Processors)

Spee

dup

BT

SP

LU

MG

FT

IS

Ideal

Page 7: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 7

Where Time is Spent ( P )

• Reveal basic Processor and network loading (vs P)

• Basis for model derivation - comm(P)

LU (Origin)

0

500

1,000

1,500

2,000

2,500

3,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

LU (Cluster)

0

500

1,000

1,500

2,000

2,500

3,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

Page 8: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 8

Where Time is Spent ( P ) - cont

• Reveal basic Processor and network loading (vs P)

FT (Cluster)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

FT (Origin)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

Page 9: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 9

Communication Volume ( P )

Total Communication Volume

0

20

40

60

80

100

120

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Vol

ume

(GB

)

BT

SP

LU

MG

FT

IS

Bytes Per Processor

0

1,000

2,000

3,000

4,000

5,000

6,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Vol

ume

(MB

)

BT

SP

LU

MG

FT

IS

Page 10: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 10

Communication Structure ( P )

Normalized Messages Per Processor

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Mes

sage

s P

er P

roce

ssor BT

SP

LU

MG

FT

IS

Average Message Size

1

10

100

1,000

10,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Ave

rage

Mes

sage

Siz

e (K

B)

BT

SP

LU

MG

FT

IS

Page 11: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 11

Understanding Efficiency ( P, M )

• Want to understand both what load the program is placing on the system

• and how well the system is handling that load=> characterize the capability of the system via simple benchmarks

(rather than advertised peaks)

=> combine with measured load for predictive model, & compare

MPI One-way Latency on Cluster

0

10

20

30

40

50

60

70

1 10 100 1000

Message Size (Bytes)

Tim

e (u

sec)

MPI One-way Latency on Origin

0

10

20

30

40

50

60

70

1 10 100 1,000

Message Size (Bytes)

Tim

e (u

sec)

Page 12: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 12

Communication Efficiency

Cluster (rendezvous)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%) BT

SP

LU

MG

FT

Origin

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%)

BT

SP

LU

MG

FT

IS

Page 13: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 13

Tools => Improvements in Run Time

• Efficiency analysis (vs parameters) gives insight into where to improve the system or the program– use traditional profiling to see where is program the ‘bad

stuff’ happens

– or go back and tune the system to do better

Cluster (eager)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%)

BT

SP

LU

MG

FT

IS

Cluster (rendezvous)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%) BT

SP

LU

MG

FT

Page 14: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 14

Cache Behavior (P, $)

• Combining trace generation with simulation provides new structural insight

• Here: clear knees in program working set ($)these shift with machine size (P)

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

32

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

Page 15: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 15

Cache Behavior (P, $)

• Clear knees in program working set ($) not affected by P

FT

0

5

10

15

20

25

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

32

Page 16: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 16

Sensitivity to Multiprogramming

• Parallel machines are increasingly general purpose– multiprogramming, at least interrupts and daemons

• Many ‘ideal’ programs very sensitive to perturbations– Msg Passing is loosely coupled, but implementation may not be!

1 1 1

6.39

1.43

4.11

19.05

1.63

5.86

20.25

1.65

6.53

0

24

68

1012

1416

1820

22

LU FT MG

Slowdown

Dedicated1-Seq2-Seq3-Seq

1 1 1

4.20

1.28

4.18

18.24

1.51

6.27

0

24

68

1012

1416

1820

22

LU FT MG

Slowdown

Dedicated2-PP3-PP

Page 17: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 17

Tools => Improvements in Run Time

• MPI implementation spin-waits on send till network available (or queue not full) or on recv-complete

• Should use two-phase spin-block

1 1 1

1.24

0.96

1.16

1.31

0.91

1.20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

LU FT MG

Slowdown

Dedicated2-PP3-PP

Page 18: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 18

Sensitivity to Seemingly Unrelated Activity

• The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point– tend to get crisp, fast results in the wee hours

• Extend study outside the app

• Example: two programs on big Origin– alone together on

64 P

– 8 processor IS run: 4.71 sec 6.18

– 36 processor SP run: 26.36 sec 65.28

Page 19: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 19

Repeatability

• The variance for the repeated runs is a key result for production codes - the real world is not ideal

Scatter Plot of FT Runtime on Origin (30 samples)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32

Machine Size (processors)T

ime

(sec

onds

) Average

Scatter Plot of LU Runtime on Origin (30 samples)

0

200

400

600

800

1000

1200

1400

0 4 8 12 16 20 24 28 32

Machine Size (processors)

Tim

e (s

econ

ds) Average

Page 20: Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley

11/5/99 LLNL ASCI III 20

Plans

• Integrate our instrumentation and analysis tools with ACTS TAU– port to UCB Millennium environment

– experiment with ASCI platforms

• Refine and complete the automated sensitivity analysis framework

• Backend performance data storage– Pablo SPPF?

• Next Year– integrate performance model development, prediction