parallel programming in c with mpi and openmp

30
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn Michael J. Quinn

Upload: candace-hanson

Post on 31-Dec-2015

61 views

Category:

Documents


5 download

DESCRIPTION

Parallel Programming in C with MPI and OpenMP. Michael J. Quinn. Chapter 11. Matrix Multiplication. Outline. Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm. . =. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Programmingin C with MPI and OpenMP

Michael J. QuinnMichael J. Quinn

Page 2: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Chapter 11

Matrix MultiplicationMatrix Multiplication

Page 3: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Outline

Sequential algorithmsSequential algorithms Iterative, row-orientedIterative, row-oriented Recursive, block-orientedRecursive, block-oriented

Parallel algorithmsParallel algorithms Rowwise block striped decompositionRowwise block striped decomposition Cannon’s algorithmCannon’s algorithm

Page 4: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

=

Iterative, Row-oriented AlgorithmSeries of inner product (dot product) operations

Page 5: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Performance as n Increases

Page 6: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Reason:Matrix B Gets Too Big for Cache

Computing a row of C requiresaccessing every element of B

Page 7: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Block Matrix Multiplication

=

Replace scalar multiplicationwith matrix multiplication

Replace scalar addition with matrix addition

Page 8: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Recurse Until B Small Enough

Page 9: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Comparing Sequential Performance

Page 10: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

First Parallel Algorithm

PartitioningPartitioning Divide matrices into rowsDivide matrices into rows Each primitive task has corresponding Each primitive task has corresponding

rows of three matrices rows of three matrices CommunicationCommunication

Each task must eventually see every row Each task must eventually see every row of Bof B

Organize tasks into a ringOrganize tasks into a ring

Page 11: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

First Parallel Algorithm (cont.)

Agglomeration and mappingAgglomeration and mapping Fixed number of tasks, each requiring Fixed number of tasks, each requiring

same amount of computationsame amount of computation Regular communication among tasksRegular communication among tasks Strategy: Assign each process a Strategy: Assign each process a

contiguous group of rowscontiguous group of rows

Page 12: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication of B

A

A B CA

A B C

A

A B CA

A B C

Page 13: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication of B

A

A B CA

A B C

A

A B CA

A B C

Page 14: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication of B

A

A B CA

A B C

A

A B CA

A B C

Page 15: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Communication of B

A

A B CA

A B C

A

A B CA

A B C

Page 16: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Complexity Analysis

Algorithm has Algorithm has pp iterations iterations During each iteration a process multipliesDuring each iteration a process multiplies

((n n / / pp) ) ( (nn / / pp) block of A by () block of A by (n n / / pp) ) nn block of B: block of B: ((nn33 / / pp22))

Total computation time: Total computation time: ((nn33 / / pp)) Each process ends up passingEach process ends up passing

((pp-1)-1)nn22//p = p = ((nn22) elements of B) elements of B

Page 17: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Isoefficiency Analysis

Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))

Isoefficiency relation: Isoefficiency relation: nn33 CpnCpn22 n n CpCp

This system does not have good scalabilityThis system does not have good scalability

pCppCpCpM 222 //)(

Page 18: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Weakness of Algorithm 1

Blocks of B being manipulated have Blocks of B being manipulated have pp times times more columns than rowsmore columns than rows

Each process must access every element of Each process must access every element of matrix Bmatrix B

Ratio of computations per communication Ratio of computations per communication is poor: onlyis poor: only 2n / 2n / pp

Page 19: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Algorithm 2(Cannon’s Algorithm)

Associate a primitive task with each matrix Associate a primitive task with each matrix elementelement

Agglomerate tasks responsible for a square Agglomerate tasks responsible for a square (or nearly square) block of C(or nearly square) block of C

Computation-to-communication ratio rises Computation-to-communication ratio rises to to nn / / pp

Page 20: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Elements of A and B Needed to Compute a Process’s Portion of C

Algorithm 1

Cannon’sAlgorithm

Page 21: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Blocks Must Be Aligned

Before After

Page 22: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Blocks Need to Be Aligned

A00

B00

A01

B01

A02

B02

A03

B03

A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Each trianglerepresents a matrix block

Only same-colortriangles shouldbe multiplied

Page 23: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Rearrange Blocks

A00

B00

A01

B01

A02

B02

A03

B03

A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Block Aij cyclesleft i positions

Block Bij cyclesup j positions

Page 24: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Consider Process P1,2

B02

A10A11 A12

B12

A13

B22

B32 Step 1

Page 25: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Consider Process P1,2

B12

A11A12 A13

B22

A10

B32

B02 Step 2

Page 26: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Consider Process P1,2

B22

A12A13 A10

B32

A11

B02

B12 Step 3

Page 27: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Consider Process P1,2

B32

A13A10 A11

B02

A12

B12

B22 Step 4

Page 28: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Complexity Analysis

Algorithm has Algorithm has pp iterations iterations During each iteration process multiplies two During each iteration process multiplies two

((n n / / pp ) ) ( (nn / / pp ) matrices: ) matrices: ((nn3 3 / / p p 3/23/2)) Computational complexity: Computational complexity: ((nn3 3 / / pp)) During each iteration process sends and During each iteration process sends and

receives two blocks of size receives two blocks of size ((n n / / pp ) ) ( (nn / / pp ) )

Communication complexity: Communication complexity: ((nn22/ / pp))

Page 29: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Isoefficiency Analysis

Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))

Isoefficiency relation:Isoefficiency relation: n n33 C C pn pn22 n n C C p p

This system is highly scalableThis system is highly scalable

22 //)( CppCppCM

Page 30: Parallel Programming in C with MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Summary Considered two sequential algorithmsConsidered two sequential algorithms

Iterative, row-oriented algorithmIterative, row-oriented algorithm Recursive, block-oriented algorithmRecursive, block-oriented algorithm Second has better cache hit rate as Second has better cache hit rate as nn increases increases

Developed two parallel algorithmsDeveloped two parallel algorithms First based on rowwise block striped decompositionFirst based on rowwise block striped decomposition Second based on checkerboard block decompositionSecond based on checkerboard block decomposition Second algorithm is scalable, while first is notSecond algorithm is scalable, while first is not