parallel programming in c with mpi and openmp
DESCRIPTION
Parallel Programming in C with MPI and OpenMP. Michael J. Quinn. Chapter 11. Matrix Multiplication. Outline. Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm. . =. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/1.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
![Page 2: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/2.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 11
Matrix MultiplicationMatrix Multiplication
![Page 3: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/3.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
Sequential algorithmsSequential algorithms Iterative, row-orientedIterative, row-oriented Recursive, block-orientedRecursive, block-oriented
Parallel algorithmsParallel algorithms Rowwise block striped decompositionRowwise block striped decomposition Cannon’s algorithmCannon’s algorithm
![Page 4: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/4.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
=
Iterative, Row-oriented AlgorithmSeries of inner product (dot product) operations
![Page 5: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/5.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance as n Increases
![Page 6: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/6.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Reason:Matrix B Gets Too Big for Cache
Computing a row of C requiresaccessing every element of B
![Page 7: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/7.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Block Matrix Multiplication
=
Replace scalar multiplicationwith matrix multiplication
Replace scalar addition with matrix addition
![Page 8: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/8.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Recurse Until B Small Enough
![Page 9: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/9.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Comparing Sequential Performance
![Page 10: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/10.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
First Parallel Algorithm
PartitioningPartitioning Divide matrices into rowsDivide matrices into rows Each primitive task has corresponding Each primitive task has corresponding
rows of three matrices rows of three matrices CommunicationCommunication
Each task must eventually see every row Each task must eventually see every row of Bof B
Organize tasks into a ringOrganize tasks into a ring
![Page 11: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/11.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
First Parallel Algorithm (cont.)
Agglomeration and mappingAgglomeration and mapping Fixed number of tasks, each requiring Fixed number of tasks, each requiring
same amount of computationsame amount of computation Regular communication among tasksRegular communication among tasks Strategy: Assign each process a Strategy: Assign each process a
contiguous group of rowscontiguous group of rows
![Page 12: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/12.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
![Page 13: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/13.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
![Page 14: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/14.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
![Page 15: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/15.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Communication of B
A
A B CA
A B C
A
A B CA
A B C
![Page 16: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/16.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Algorithm has Algorithm has pp iterations iterations During each iteration a process multipliesDuring each iteration a process multiplies
((n n / / pp) ) ( (nn / / pp) block of A by () block of A by (n n / / pp) ) nn block of B: block of B: ((nn33 / / pp22))
Total computation time: Total computation time: ((nn33 / / pp)) Each process ends up passingEach process ends up passing
((pp-1)-1)nn22//p = p = ((nn22) elements of B) elements of B
![Page 17: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/17.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Analysis
Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))
Isoefficiency relation: Isoefficiency relation: nn33 CpnCpn22 n n CpCp
This system does not have good scalabilityThis system does not have good scalability
pCppCpCpM 222 //)(
![Page 18: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/18.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Weakness of Algorithm 1
Blocks of B being manipulated have Blocks of B being manipulated have pp times times more columns than rowsmore columns than rows
Each process must access every element of Each process must access every element of matrix Bmatrix B
Ratio of computations per communication Ratio of computations per communication is poor: onlyis poor: only 2n / 2n / pp
![Page 19: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/19.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Algorithm 2(Cannon’s Algorithm)
Associate a primitive task with each matrix Associate a primitive task with each matrix elementelement
Agglomerate tasks responsible for a square Agglomerate tasks responsible for a square (or nearly square) block of C(or nearly square) block of C
Computation-to-communication ratio rises Computation-to-communication ratio rises to to nn / / pp
![Page 20: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/20.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Elements of A and B Needed to Compute a Process’s Portion of C
Algorithm 1
Cannon’sAlgorithm
![Page 21: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/21.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Blocks Must Be Aligned
Before After
![Page 22: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/22.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Blocks Need to Be Aligned
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Each trianglerepresents a matrix block
Only same-colortriangles shouldbe multiplied
![Page 23: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/23.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Rearrange Blocks
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Block Aij cyclesleft i positions
Block Bij cyclesup j positions
![Page 24: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/24.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B02
A10A11 A12
B12
A13
B22
B32 Step 1
![Page 25: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/25.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B12
A11A12 A13
B22
A10
B32
B02 Step 2
![Page 26: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/26.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B22
A12A13 A10
B32
A11
B02
B12 Step 3
![Page 27: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/27.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Consider Process P1,2
B32
A13A10 A11
B02
A12
B12
B22 Step 4
![Page 28: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/28.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Complexity Analysis
Algorithm has Algorithm has pp iterations iterations During each iteration process multiplies two During each iteration process multiplies two
((n n / / pp ) ) ( (nn / / pp ) matrices: ) matrices: ((nn3 3 / / p p 3/23/2)) Computational complexity: Computational complexity: ((nn3 3 / / pp)) During each iteration process sends and During each iteration process sends and
receives two blocks of size receives two blocks of size ((n n / / pp ) ) ( (nn / / pp ) )
Communication complexity: Communication complexity: ((nn22/ / pp))
![Page 29: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/29.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Isoefficiency Analysis
Sequential algorithm: Sequential algorithm: ((nn33)) Parallel overhead: Parallel overhead: ((pnpn22))
Isoefficiency relation:Isoefficiency relation: n n33 C C pn pn22 n n C C p p
This system is highly scalableThis system is highly scalable
22 //)( CppCppCM
![Page 30: Parallel Programming in C with MPI and OpenMP](https://reader033.vdocument.in/reader033/viewer/2022061618/56812fb5550346895d953972/html5/thumbnails/30.jpg)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary Considered two sequential algorithmsConsidered two sequential algorithms
Iterative, row-oriented algorithmIterative, row-oriented algorithm Recursive, block-oriented algorithmRecursive, block-oriented algorithm Second has better cache hit rate as Second has better cache hit rate as nn increases increases
Developed two parallel algorithmsDeveloped two parallel algorithms First based on rowwise block striped decompositionFirst based on rowwise block striped decomposition Second based on checkerboard block decompositionSecond based on checkerboard block decomposition Second algorithm is scalable, while first is notSecond algorithm is scalable, while first is not