a new optimization technique for the inspector-executor method

A New Optimization Technique for the Inspector-Executor Method

Daisuke Yokota†

Shigeru Chiba‡

Kozo Itano†

† University of Tsukuba‡Tokyo Institute of Technology

Computer Simulation is Expensive

Physicists are running a parallel computer at our campus every day for simulation.– Our target parallel computer costs

$45,000 every month

$1 / min International phone call between Japan and Canada.

– The program runs very long. A week or more.

Hardware for Fast Inter-node Communication

– Our computer SR2201 has such hardwarefor avoiding communication bottleneck

Should be used but not in the real…– At least, at our computer center– It is not used by compiler

Difficult to generate optimized code for that hardware

– It is not used by programmer Programmers are not computer engineers but physicists

Our HPF Compiler

Optimization for– Utilizing hardware for inter-node communication

Technique– The Inspector-Executor method

plus Static Code Optimization– Compilation is executed in parallel

Target– Hitachi SR2201

Optimizations

Reducing the amount of the exchanged data– Our compiler allocates loop iterations to appropriate

nodes for minimizing communication

Merging multiple messages– Our target computer provides hardware support– Our compiler tries to use that hardware

Reusing TCW– Another hardware support– To reduce setup time for each message sending

Merging Multiple Messages

Hardware support:– Block-Stride Communication– Multiple messages are sent

as a single message

(Data must be stored at regular Intervals)

Sender Receiver

Reusing TCW

TCW: Transfer Control Word Reusing parameters to the communication

hardware

do I=1,…

end do

setting send

do I=1,…

end do

before optimization after optimization

setting

Implementation:Original Inspector-Executor Method

Goal: Parallelize a loop by runtime analysis Inspector runs at runtime

Inspector

Determines which array elements must be exchanged among nodes

Executor

1. Exchanges array elements2. Executes a loop body in

parallel3. Exchanges array elements

Resulting data of the analysis

Our ImprovedInspector Executor Method

Inspector produces statically optimized code of the executor.– Inspector runs off-line.– Running Inspector is part of the compilation

process.

Inspector Executor

Optimized executor code- Not data!

Static Code Optimization

Inspector performs constant folding– When generating the executor code

Constant folding eliminates from Executor:– A table containing the result of the analysis

by InspectorSaves memory space (the table size is big!)

– Memory access for table-lookupBetter performance

OUTER directive

Specifies the range of analysis by Inspector.– OUTER Loop– We assume that the program structure fits the

structure of typical simulation programs.This repeats millions of timesduring the simulation.

INNER LoopThis is parallelized.

Executor

OUTER Loop

Restrictions

Programmers must guarantee …– Every iteration of the OUTER loop needs to

exchange the same set of array elements among nodes.

Since Inspector analyzes only the first iteration

– The set of exchanged array elements is determined without executing inter-node communication

Inspector does not perform the communication for reducing the compilation time

Our compiler cannot compile IS of NAS parallel benchmark

Our Compiler Runs on a PC Cluster

For executing inspectorin parallel.– Inspector must analyze a large

amount of data.

– In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.

Execution Flow of Our Compiler

Source Program

Translate into SPMD

Generate Inspector

Inspector Log

Analysis

Code Generation

Generate Inspector

Inspector Log

Analysis

Code Generation

〃〃

Exchange Information of Messages

SPMD Parallel code

Our Prototype Compiler

Fortran77 + HPF + OUTER directive– Output: SPMD Fortran code

Target machine– Compilation:

PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet

– Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes

Experiments: Pde1 benchmark

Poisson Equation Good for massively parallel computing

– Regular array accesses– High scalability– Distributed array accesses are centralized in a small

region of source code

Execution Time (pde1)

1 2 4 8 16Number of nodes

Speedu

OursHitachi HPFLinear

249sec

137,100sec

Hitachi’s HPF compiler needs more directives for better performance

Effects by static code optimization (pde1)

Number of nodes

1 2 4 8 16

dynamicstatic

Reductionof executiontime

Compilation Time (pde1)

2 4 8 16Number of nodes

backend Fortransequentialparallel

data exchange

Long compilation time is paid off if the OUTERloop iterates many times.

Experiment: FT-a

3D Fourier Transformation Features

– Irregular array accesses– Distributed array accesses are centralized in a small

region of source code

Execution Time (FT-a)

Speedu

p OursHitachi HPFLinear

4,898sec

Compilation Time (FT-a)

100150200250300350

backendsequentialparallel

data exchange

Experiments: BT-a

Block Tri-diagonal Solver Features

– A small number of irregular array accesses– Distributed array accesses are scattered all over the

source code

Execution Time (BT-a)

Speedu

p OursHitachi HPFLinear

1,430sec

1,370,000sec

Compilation Time (BT-a)

10000150002000025000300003500040000

backendsequentialparallel

data exchange

Inspector must analyze a huge numberof array accesses

Our compiler cannot achieve good performance

Conclusion

HPF compiler for utilizing hardware for inter-node communication

– Inspector-executor method– Static code optimization

Inspector produces optimized executor code– Compiler runs on a PC cluster

Experiment– Long compilation time is acceptable for simulation programs

running for long time

予備

通信量の削減 ( 最適化 )

通信量が少なくなるようにループのくり返しを分配

– データの分割は HPF で指示

– 予備実行で発生するであろう通信量を調べる

ループのくり返し

受け持つプロセッサ

PE 1P E 1

PE 2PE 2

必要

な通

信量

Merging Multiple Messages

Our compiler collects several messages sent in a single message

– Messages in the loop with INDEPENDENT directive can be merged This directive specifies that the result of that loop is

independent of the execution order of the iterations

Our compiler finds block-stride communication to reduce a number of communication by pattern matching

Future Works

We want to reduce a number of communication more– We want to use block stride communication more aggressively

(If with sending redundant data they could be merged into small number of communication, )◎

Prevention of expanding too long code– If data dependency between processors are too complex, our

compiler generates too many communication operations

Improvement of scalability of compilation time– Inspector log by BT was too huge

Experiments with real simulations

CP-PACS/Pilot-3

Distributed memory machine– Center for Computational Physics at U

niversity of Tsukuba– 2048PEs(CP-PACS),128PEs(Pilot-3)– Hyper crossbar– RDMA

Our Optimizer to Solve the Problem

Use of special communication devices– Parallel machines sometimes have special

hardware to reduce a time for inter-node communication

Development of compilers for easy and well-know computer languages– Fortran77, Simple HPF(High Performance Fortran)

Runtime analysis– Profiler about communication on PC-cluster

Effects by static code optimization (pde1)

Number of nodes

2 4 8 16

dynamicstatic

Reductionof executiontime

a new optimization technique for the inspector-executor method

executor codeconstant

exchanged array elements

computer engineers

optimized code

computer centerit

hardware supportour

set of array elements

target parallel computer

Documents

new york. new york 10016 how to be an executor checklist ·...

executor kung fu - estate...

understanding executor office and use

murder on the executor

executor duties in new york: why your executor is so...

56060108 understanding executor office and use of the...

executor services - state trustees vic · 2018-01-24 ·...

new york executor duties

a collective i/o implementation based on inspector ... ·...

executor ease - concentra

trade executor karen foo

a handbook for the independent executor...a handbook for the...

tailored solutions for executors executor ease · ease...

being an executor - nanaimo notary · being an executor...

executor toolkit - eqt.com.au

poseidon executor 2008 technical specification

thoughts on use of the executor letter -...

the openwhisk platform - elinux€¦ · openwhisk under the...

how-to choose your executor infographic

role of the executor - what you need to know · duties and...