processes distribution of homogeneous parallel linear algebra routines on heterogeneous clusters

24
Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters Javier Cuenca Luis Pedro García Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain Jack Dongarra Innovative Computing Laboratory, University of Tennessee, USA

Upload: elyse

Post on 26-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Javier Cuenca Luis Pedro García Domingo Giménez Scientific Computation Researching Group , University of Murcia, Spain Jack Dongarra Innovative Computing Laboratory, University of Tennessee, USA. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Antonio Javier Cuenca Muñoz

Dpto. Ingeniería y Tecnología de Computadores

Processes Distribution of HomogeneousParallel Linear Algebra Routines on

Heterogeneous Clusters

Processes Distribution of HomogeneousParallel Linear Algebra Routines on

Heterogeneous Clusters

Javier Cuenca

Luis Pedro García

Domingo Giménez Scientific Computation Researching Group, University of Murcia, Spain

Jack DongarraInnovative Computing Laboratory, University of Tennessee, USA

Page 2: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

2Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Introduction

Automatically Optimised Linear Algebra Software Objective

Software capable of tuning itself according to the execution environment

Motivation Non-expert users take decisions about computation Software should adapt to the continuous evolution of hardware Developing efficient code by hand consumes a large quantity of resources System computation capabilities are very variable

Some examples of auto-tuning software: ATLAS, LFC, FFTW, I-LIB, FIBER, mpC, BeBOP, FLAME, ...

Page 3: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

3Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Automatic Optimisation on Heterogeneous Parallel Systems

Two possibilities on heterogeneous systems: HoHe: Heterogeneous algorithms (heterogeneous distribution

of data). HeHo: Homogeneous algorithms and heterogeneous

assignation of processes: A variable number of processes to each processor, depending on the

relative speeds

Mapping processes processors must be made, and without a large execution time in the decision taking

Theoretical models: parameters which represent the characteristics of the system

The general assignation problem is NP use of heuristic approximations

Page 4: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

4Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our previous HoHo methodology

Routines model

n: problem size

SP: system parameters Computation and communication characteristics of the system

AP: algorithm parameters Block size, number of processors to use, logical configurations of the

processors, ... (with one process per processor) Values are chosen when the routine begins to run

),,( APSPnfTEXEC

Page 5: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

6Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our previous HoHo methodology Our HeHo meth.

Modifications in the routine model: New AP:

Number of processes to generate Mapping processes to processors

SP values changes: More than one process per processor: Each SPi in processor i as di

(number of processes assigned to processor i) times higher

Implicit synchronization global value of each of the SPi is considered as the maximum value from all the processors.

The slowest process forces to the other ones to reduce their speed, waiting for it at the different synchronization points of the routine.

Page 6: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

7Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology: an example of routine model

LU factorisation, parallel version. Model:

SP: system parameters k3_DGEMM, k3_DTRSM, k2_DGETF2

ts, tw

AP: algorithm parameters b: block size P: number of processors p: number of processes Mapping p processes on the P processors p = r x c: logical configuration of the processes: 2D mesh

nkbnbkp

cr

p

nkT DGETFDTRSMDGEMMARI 2_2

22_3

3

_3 3

1

3

2

p

dnt

b

ndtT wsCOM

222 COMARIEXEC TTT

Page 7: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

8Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology: an example of routine model

Platforms: SUNEt:

Five SUN Ultra 1 One SUN Ultra 5 Interconexion network: Ethernet

TORC (Innovative Computing Laboratory): 21 nodes of different types

dual and single processors Pentium II, III and 4 AMD Athlon

Interconexion networks: FastEthernet Giganet Myrinet

Page 8: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

9Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology: an example of routine model

Theoretical vs. Experimental time on SUNEt.n=2048

Mapping of 8

processes on the

6 processors

Logical topology

of the 8

processes

Block

size

AP 1

AP 2

AP 3

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

2 х 4

2 х 4

2 х 4

32

32

32

AP 4

AP 5

AP 6

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

2 х 4

2 х 4

2 х 4

64

64

64

AP 7

AP 8

AP 9

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

1 х 8

1 х 8

1 х 8

32

32

32

AP 10

AP 11

AP 12

(1,1,1,1,1,3)

(2,1,1,1,1,2)

(2,2,1,1,1,1)

1 х 8

1 х 8

1 х 8

64

64

64

0

50

100

150

200

250

AP1AP2

AP3AP4

AP5AP6

AP7AP8

AP9AP10

AP11AP12

theoretical time

experimental time

Page 9: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

10Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology: an example of routine model

Theoretical vs. Experimental time on TORC. n=4096

Mapping of 8 processes on 19

processors

Logical

topology of

the 8

processes

Block

Size

AP 1

AP 2

(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)

(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0)

4 х 2

8 х 1

32

32

AP 3

AP 4

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,2,0)

4 х 2

8 х 1

32

32

AP 5

AP 6

(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)

(1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1)

4 х 2

8 х 1

32

32

AP 7

AP 8

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)

(1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,2,0,0)

4 х 2

8 х 1

32

32

0

10

20

30

40

50

60

70

AP1 AP2 AP3 AP4 AP5 AP6 AP7

theoretical time

experimental time

Page 10: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

11Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Our approach: Assignment tree

A limit in the height of the tree (number of processes) is necessary

Each node represents a possible solution: processesprocessors

The other APs (block size, logical topology) are chosen at each node

3 P

PP3P32P321

21 ...

.........

P processors

p processes

...

Page 11: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

12Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

For each node: EET(node): Estimated Execution Time

Optimization problem: finding the node with the lowest EET

LET(node): Lowest Execution Time GET(node): Greatest Execution Time

LET and GET are lower and upper bounds of the optimum solution of the subtree below the node

LET and GET to limit the number of nodes evaluated MEET = minevaluated_nodes {GET(node)}

If {LET (node) > MEET} do not work below this node

Page 12: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

13Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Automatic searching strategies in the assignment tree: Method 1:

Backtracking GET = EET.

Method 2: Backtraking GET obtained with a greedy approach

Method 3: Backtraking GETobtained with a greedy approach LET obtained with a greedy approach

Method 4: Greedy method on the current assignment tree

(a combinatorial tree with repetitions) Method 5:

Greedy method on a permutational tree with repetitions

Page 13: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

14Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Automatic searching strategies in the assignment tree: Method 1:

Backtracking

GET = EET

LET = LETari + LETcom

LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be obtained from this node

Page 14: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

15Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Automatic searching strategies in the assignment tree: Method 2:

Backtracking

GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution

LET = LETari + LETcom

LETari = sequential time divided by the maximum achievable speed-up when using all the processors not yet discarded

LETcom = assuming the best logical topology of processes that can be obtained from this node.

Fewer nodes are analyzed, but the evaluated cost per node increases

Page 15: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

16Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Automatic searching strategies in the assignment tree: Method 3:

Backtracking

GET = a greedy approach: the EET for each of the children of the node is calculated, and the node with the lowest EET is included in the solution

LET = LETari + LETcom

LETari = A greedy approach is used: For each node, the child that least increases the cost of

arithmetic operations is included in the solution to obtain the lowest bound

LETcom = assuming the best logical topology of processes that can be obtained from this node.

It is possible that a branch to a optimal solution will be discarded

Page 16: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

17Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Our HeHo methodology

Automatic searching strategies in the assignment tree: Method 4:

Greedy method on the current assignment tree (a combinatorial tree with repetitions)

Method 5: Greedy method on a permutational tree with repetitions

Both methods 4 and 5: To obtain better logical topologies of the processes:

traversal searching continues (through the best child for each node) until the established maximum level is reached.

Page 17: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

18Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Experimental Results

Human searching strategies in the assignment tree:

Greedy User (GU) Use ALL the available processors One process per processor

Conservative User (CU) Use HALF of the available processors One process per processor

Expert User (EU): Use 1 processor, HALF or ALL the processors depending on the problem

size One process per processor

Page 18: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

21Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Experimental Results

Automatic decisions vs. Users, on SUNEt (n = 7680)Method Processes

mapping

b Logical

Topology

Solution t. t. t. Level

1 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25

2 (1,1,1,1,1,1) 64 2 х 3 718.94 0.04 25

3 (1,1,1,1,1,1) 64 2 х 3 718.94 0.02 25

4 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0001 25

5 (1,1,0,0,0,1) 128 1 х 3 887.85 0.0005 25

CU (1,1,0,0,0,1) 128 1 х 3 1047.13

GU (1,1,1,1,1,1) 64 2 х 3 887.85

EU (1,1,1,1,1,1) 64 2 х 3 887.85

Page 19: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

22Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Experimental Results

Automatic decisions vs. Users, on TORC (n = 2048)Method Processes

mapping

b Logical

Topology

Solution t. t. t. Level

1 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15

2 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 3 х 5 17.91 3.08 15

3 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.06 25

4 (0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,1,0)64 1 х 1 43.16 0.0012 30

5 (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,0,0,0,0)64 4 х 4 15.27 0.01 30

CU (1,1,1,1,1,1,0,0,0,0,

0,0,0,0,0,0,1,1,1)64 3 х 3 23.77

GU (1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1)32 1 х 19 33.57

EU (1,1,1,1,1,1,0,0,0,0,

0,0,0,0,0,0,1,1,1)64 3 х 3 23.77

Page 20: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

24Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Simulations

Virtual Platforms: variations and/or increases of the

real platforms: mTORC-01

the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4

mTORC-02 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and

20 respectively. Number of processors: 50. Types of processors: 4

mTORC-03 the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and

10, respectively additional processors have been included Number of processors: 100. Types of processors: 10

Page 21: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

25Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Simulations

Automatic decisions vs. Users

On virtual platform: mTORC01 (n = 20000) the quantity of 17P4 is increased to 11 Number of processors: 29. Types of processors: 4

Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU

Solution 666.44 818.82 666.44 666.44 666.44 1322.23 1145.09 1145.09

t.t.t 20.39 59.45 0.68 0.0007 0.0122

Level 15 15 20 25 25

Page 22: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

26Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Experimental Results

Automatic decisions vs. Users

On virtual platform: mTORC02 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 10, 10 and

20 respectively Number of processors: 50. Types of processors: 4

Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU

Solution 3721.98 3791.98 2439.43 1958.43 1500.24 2249.70 2748.36 2249.70

t.t.t 259.44 792.32 7.46 0.01 0.07

Level 15 15 25 30 30

Page 23: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

27Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Experimental Results

Automatic decisions vs. Users

On virtual platform: mTORC03 (n = 20000) the quantities of DPIII, SPIII, Ath and 17P4 are increased to 10, 15, 5 and

10, respectively additional processors have been included Number of processors: 100. Types of processors: 10

Met. 1 Met. 2 Met. 3 Met. 4 Met. 5 CU GU EU

Solution 10712.55 14532.45 10712.55 10712.55 4333.23 7405.34 5422.87 5422.87

t.t.t 109.24 169.72 1274.34 0.08 2.34

Level 10 10 5 25 40

Page 24: Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

28Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Conclusions

Extension of our previous self-optimisation methodology for homogeneous systems

On hetereogeneous systems, new decisions: Number of processes Mapping processes processors

Good results with parallel LU factorisation

Same methodology could be applied to other linear algebra routines