parallel programming in c with mpi and openmp

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Parallel Programmingin C with MPI and OpenMP

Michael J. QuinnMichael J. Quinn


Chapter 11

Matrix MultiplicationMatrix Multiplication


Outline

Sequential algorithmsSequential algorithms Iterative, row-orientedIterative, row-oriented Recursive, block-orientedRecursive, block-oriented

Parallel algorithmsParallel algorithms Rowwise block striped decompositionRowwise block striped decomposition Cannon’s algorithmCannon’s algorithm


=

Iterative, Row-oriented AlgorithmSeries of inner product (dot product) operations


Performance as n Increases


Reason:Matrix B Gets Too Big for Cache

Computing a row of C requiresaccessing every element of B


Block Matrix Multiplication

=

Replace scalar multiplicationwith matrix multiplication

Replace scalar addition with matrix addition


Recurse Until B Small Enough


Comparing Sequential Performance


First Parallel Algorithm

PartitioningPartitioning Divide matrices into rowsDivide matrices into rows Each primitive task has corresponding Each primitive task has corresponding

rows of three matrices rows of three matrices CommunicationCommunication

Each task must eventually see every row Each task must eventually see every row of Bof B

Organize tasks into a ringOrganize tasks into a ring


First Parallel Algorithm (cont.)

Agglomeration and mappingAgglomeration and mapping Fixed number of tasks, each requiring Fixed number of tasks, each requiring

same amount of computationsame amount of computation Regular communication among tasksRegular communication among tasks Strategy: Assign each process a Strategy: Assign each process a

contiguous group of rowscontiguous group of rows


Communication of B

A

A B CA

A B C

A

A B CA

A B C


Complexity Analysis

Algorithm has Algorithm has pp iterations iterations During each iteration a process multipliesDuring each iteration a process multiplies

((n n / / pp) ) ( (nn / / pp) block of A by () block of A by (n n / / pp) ) nn block of B: block of B: ((nn33 / / pp22))

Total computation time: Total computation time: ((nn33 / / pp)) Each process ends up passingEach process ends up passing

((pp-1)-1)nn22//p = p = ((nn22) elements of B) elements of B


Isoefficiency Analysis

Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))

Isoefficiency relation: Isoefficiency relation: nn33 CpnCpn22 n n CpCp

This system does not have good scalabilityThis system does not have good scalability

pCppCpCpM 222 //)(


Weakness of Algorithm 1

Blocks of B being manipulated have Blocks of B being manipulated have pp times times more columns than rowsmore columns than rows

Each process must access every element of Each process must access every element of matrix Bmatrix B

Ratio of computations per communication Ratio of computations per communication is poor: onlyis poor: only 2n / 2n / pp


Parallel Algorithm 2(Cannon’s Algorithm)

Associate a primitive task with each matrix Associate a primitive task with each matrix elementelement

Agglomerate tasks responsible for a square Agglomerate tasks responsible for a square (or nearly square) block of C(or nearly square) block of C

Computation-to-communication ratio rises Computation-to-communication ratio rises to to nn / / pp


Elements of A and B Needed to Compute a Process’s Portion of C

Algorithm 1

Cannon’sAlgorithm


Blocks Must Be Aligned

Before After


Blocks Need to Be Aligned

A00

B00

A01

B01

A02

B02

A03

B03

A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Each trianglerepresents a matrix block

Only same-colortriangles shouldbe multiplied


Rearrange Blocks

A00

B00

A01

B01

A02

B02

A03

B03

A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Block Aij cyclesleft i positions

Block Bij cyclesup j positions


Consider Process P1,2

B02

A10A11 A12

B12

A13

B22

B32 Step 1



B12

A11A12 A13

B22

A10

B32

B02 Step 2



B22

A12A13 A10

B32

A11

B02

B12 Step 3



B32

A13A10 A11

B02

A12

B12

B22 Step 4


Complexity Analysis

Algorithm has Algorithm has pp iterations iterations During each iteration process multiplies two During each iteration process multiplies two

((n n / / pp ) ) ( (nn / / pp ) matrices: ) matrices: ((nn3 3 / / p p 3/23/2)) Computational complexity: Computational complexity: ((nn3 3 / / pp)) During each iteration process sends and During each iteration process sends and

receives two blocks of size receives two blocks of size ((n n / / pp ) ) ( (nn / / pp ) )

Communication complexity: Communication complexity: ((nn22/ / pp))


Isoefficiency Analysis

Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))

Isoefficiency relation:Isoefficiency relation: n n33 C C pn pn22 n n C C p p

This system is highly scalableThis system is highly scalable

22 //)( CppCppCM


Summary Considered two sequential algorithmsConsidered two sequential algorithms

Iterative, row-oriented algorithmIterative, row-oriented algorithm Recursive, block-oriented algorithmRecursive, block-oriented algorithm Second has better cache hit rate as Second has better cache hit rate as nn increases increases

Developed two parallel algorithmsDeveloped two parallel algorithms First based on rowwise block striped decompositionFirst based on rowwise block striped decomposition Second based on checkerboard block decompositionSecond based on checkerboard block decomposition Second algorithm is scalable, while first is notSecond algorithm is scalable, while first is not

parallel programming in c with mpi and openmp

Documents