a new optimization technique for the inspector-executor method

33
1 A New Optimization Technique for the Inspector-Executor Method Daisuke Yokota† Shigeru Chiba‡ Kozo Itano† † University of Tsukuba ‡Tokyo Institute of Technolo

Upload: zagiri

Post on 05-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

A New Optimization Technique for the Inspector-Executor Method. Daisuke Yokota † Shigeru Chiba ‡ Kozo Itano †. † University of Tsukuba ‡Tokyo Institute of Technology. Computer Simulation is Expensive. Physicists are running a parallel computer at our campus every day for simulation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A New Optimization Technique for the Inspector-Executor Method

1

A New Optimization Technique for the Inspector-Executor Method

Daisuke Yokota†

Shigeru Chiba‡

Kozo Itano†

† University of Tsukuba‡Tokyo Institute of Technology

Page 2: A New Optimization Technique for the Inspector-Executor Method

2

Computer Simulation is Expensive

Physicists are running a parallel computer at our campus every day for simulation.– Our target parallel computer costs

$45,000 every month

$1 / min International phone call between Japan and Canada.

– The program runs very long. A week or more.

Page 3: A New Optimization Technique for the Inspector-Executor Method

3

Hardware for Fast Inter-node Communication

– Our computer SR2201 has such hardwarefor avoiding communication bottleneck

Should be used but not in the real…– At least, at our computer center– It is not used by compiler

Difficult to generate optimized code for that hardware

– It is not used by programmer Programmers are not computer engineers but physicists

Page 4: A New Optimization Technique for the Inspector-Executor Method

4

Our HPF Compiler

Optimization for– Utilizing hardware for inter-node communication

Technique– The Inspector-Executor method

plus Static Code Optimization– Compilation is executed in parallel

Target– Hitachi SR2201

Page 5: A New Optimization Technique for the Inspector-Executor Method

5

Optimizations

Reducing the amount of the exchanged data– Our compiler allocates loop iterations to appropriate

nodes for minimizing communication

Merging multiple messages– Our target computer provides hardware support– Our compiler tries to use that hardware

Reusing TCW– Another hardware support– To reduce setup time for each message sending

Page 6: A New Optimization Technique for the Inspector-Executor Method

6

Merging Multiple Messages

Hardware support:– Block-Stride Communication– Multiple messages are sent

as a single message

(Data must be stored at regular Intervals)

Sender Receiver

Page 7: A New Optimization Technique for the Inspector-Executor Method

7

Reusing TCW

TCW: Transfer Control Word Reusing parameters to the communication

hardware

do I=1,…

end do

setting send

do I=1,…

end do

before optimization after optimization

setting

send

Page 8: A New Optimization Technique for the Inspector-Executor Method

8

Implementation:Original Inspector-Executor Method

Goal: Parallelize a loop by runtime analysis Inspector runs at runtime

Inspector

Determines which array elements must be exchanged among nodes

Executor

1. Exchanges array elements2. Executes a loop body in

parallel3. Exchanges array elements

Resulting data of the analysis

Page 9: A New Optimization Technique for the Inspector-Executor Method

9

Our ImprovedInspector Executor Method

Inspector produces statically optimized code of the executor.– Inspector runs off-line.– Running Inspector is part of the compilation

process.

Inspector Executor

Optimized executor code- Not data!

Page 10: A New Optimization Technique for the Inspector-Executor Method

10

Static Code Optimization

Inspector performs constant folding– When generating the executor code

Constant folding eliminates from Executor:– A table containing the result of the analysis

by InspectorSaves memory space (the table size is big!)

– Memory access for table-lookupBetter performance

Page 11: A New Optimization Technique for the Inspector-Executor Method

11

OUTER directive

Specifies the range of analysis by Inspector.– OUTER Loop– We assume that the program structure fits the

structure of typical simulation programs.This repeats millions of timesduring the simulation.

INNER LoopThis is parallelized.

Executor

OUTER Loop

Page 12: A New Optimization Technique for the Inspector-Executor Method

12

Restrictions

Programmers must guarantee …– Every iteration of the OUTER loop needs to

exchange the same set of array elements among nodes.

Since Inspector analyzes only the first iteration

– The set of exchanged array elements is determined without executing inter-node communication

Inspector does not perform the communication for reducing the compilation time

Our compiler cannot compile IS of NAS parallel benchmark

Page 13: A New Optimization Technique for the Inspector-Executor Method

13

Our Compiler Runs on a PC Cluster

For executing inspectorin parallel.– Inspector must analyze a large

amount of data.

– In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.

Page 14: A New Optimization Technique for the Inspector-Executor Method

14

Execution Flow of Our Compiler

Source Program

Translate into SPMD

Generate Inspector

Inspector Log

Analysis

Code Generation

Generate Inspector

Inspector Log

Analysis

Code Generation

〃 〃

〃 〃

Exchange Information of Messages

SPMD Parallel code

Page 15: A New Optimization Technique for the Inspector-Executor Method

15

Our Prototype Compiler

Fortran77 + HPF + OUTER directive– Output: SPMD Fortran code

Target machine– Compilation:

PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet

– Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes

Page 16: A New Optimization Technique for the Inspector-Executor Method

16

Experiments: Pde1 benchmark

Poisson Equation Good for massively parallel computing

– Regular array accesses– High scalability– Distributed array accesses are centralized in a small

region of source code

Page 17: A New Optimization Technique for the Inspector-Executor Method

17

Execution Time (pde1)

0

5

10

15

20

1 2 4 8 16Number of nodes

Speedu

p

OursHitachi HPFLinear

249sec

137,100sec

Hitachi’s HPF compiler needs more directives for better performance

Page 18: A New Optimization Technique for the Inspector-Executor Method

18

Effects by static code optimization (pde1)

Number of nodes

0%

50%

100%

1 2 4 8 16

dynamicstatic

Reductionof executiontime

Page 19: A New Optimization Technique for the Inspector-Executor Method

19

Compilation Time (pde1)

0

50

100

150

200

250

2 4 8 16Number of nodes

Com

pila

tion

tim

e (

sec)

backend Fortransequentialparallel

data exchange

Long compilation time is paid off if the OUTERloop iterates many times.

Page 20: A New Optimization Technique for the Inspector-Executor Method

20

Experiment: FT-a

3D Fourier Transformation Features

– Irregular array accesses– Distributed array accesses are centralized in a small

region of source code

Page 21: A New Optimization Technique for the Inspector-Executor Method

21

Execution Time (FT-a)

0

5

10

15

20

1 2 4 8 16Number of nodes

Speedu

p OursHitachi HPFLinear

46sec

4,898sec

Page 22: A New Optimization Technique for the Inspector-Executor Method

22

Compilation Time (FT-a)

050

100150200250300350

2 4 8 16Number of nodes

Com

pila

tion

Tim

e (

sec)

backendsequentialparallel

data exchange

Page 23: A New Optimization Technique for the Inspector-Executor Method

23

Experiments: BT-a

Block Tri-diagonal Solver Features

– A small number of irregular array accesses– Distributed array accesses are scattered all over the

source code

Page 24: A New Optimization Technique for the Inspector-Executor Method

24

Execution Time (BT-a)

0

5

10

15

20

1 2 4 8 16Number of nodes

Speedu

p OursHitachi HPFLinear

1,430sec

1,370,000sec

Page 25: A New Optimization Technique for the Inspector-Executor Method

25

Compilation Time (BT-a)

05000

10000150002000025000300003500040000

2 4 8 16Number of nodes

Com

pila

tion

Tim

e (

sec)

backendsequentialparallel

data exchange

Inspector must analyze a huge numberof array accesses

Our compiler cannot achieve good performance

Page 26: A New Optimization Technique for the Inspector-Executor Method

26

Conclusion

HPF compiler for utilizing hardware for inter-node communication

– Inspector-executor method– Static code optimization

Inspector produces optimized executor code– Compiler runs on a PC cluster

Experiment– Long compilation time is acceptable for simulation programs

running for long time

Page 27: A New Optimization Technique for the Inspector-Executor Method

27

予備

Page 28: A New Optimization Technique for the Inspector-Executor Method

28

通信量の削減 ( 最適化 )

通信量が少なくなるようにループのくり返しを分配

– データの分割は HPF で指示

– 予備実行で発生するであろう通信量を調べる

ループのくり返し

P E 2

PE 1

受け持つプロセッサ

PE 1P E 1

PE 2PE 2

必要

な通

信量

Page 29: A New Optimization Technique for the Inspector-Executor Method

29

Merging Multiple Messages

Our compiler collects several messages sent in a single message

– Messages in the loop with INDEPENDENT directive can be merged This directive specifies that the result of that loop is

independent of the execution order of the iterations

Our compiler finds block-stride communication to reduce a number of communication by pattern matching

Page 30: A New Optimization Technique for the Inspector-Executor Method

30

Future Works

We want to reduce a number of communication more– We want to use block stride communication more aggressively

(If with sending redundant data they could be merged into small number of communication, )◎

Prevention of expanding too long code– If data dependency between processors are too complex, our

compiler generates too many communication operations

Improvement of scalability of compilation time– Inspector log by BT was too huge

Experiments with real simulations

Page 31: A New Optimization Technique for the Inspector-Executor Method

31

CP-PACS/Pilot-3

Distributed memory machine– Center for Computational Physics at U

niversity of Tsukuba– 2048PEs(CP-PACS),128PEs(Pilot-3)– Hyper crossbar– RDMA

Page 32: A New Optimization Technique for the Inspector-Executor Method

32

Our Optimizer to Solve the Problem

Use of special communication devices– Parallel machines sometimes have special

hardware to reduce a time for inter-node communication

Development of compilers for easy and well-know computer languages– Fortran77, Simple HPF(High Performance Fortran)

Runtime analysis– Profiler about communication on PC-cluster

Page 33: A New Optimization Technique for the Inspector-Executor Method

33

Effects by static code optimization (pde1)

Number of nodes

0%

50%

100%

2 4 8 16

dynamicstatic

Reductionof executiontime