sahalu junaidu ics 573: high performance computing 8.1 topic overview matrix-matrix multiplication...
TRANSCRIPT
Sahalu Junaidu ICS 573: High Performance Computing 8.1
Topic Overview
• Matrix-Matrix Multiplication
• Block Matrix Operations
• A Simple Parallel Matrix-Matrix Multiplication
• Cannon's Algorithm
• Overlapping Communication with Computation
Sahalu Junaidu ICS 573: High Performance Computing 8.2
Matrix-Matrix Multiplication
• Building on our matrix-vector multiplication (Quinn’s Chapter 8), we now consider matrix-matrix multiplication– multiplying two n x n dense, square matrices A and B to yield
the product matrix C = A x B.• For simplicity, we use the following serial algorithm:
Sahalu Junaidu ICS 573: High Performance Computing 8.3
Block Matrix Operations
• Matrix computations involving scalar algebraic operations on the matrix elements can be expressed in terms of identical operations on submatrices of the original matrix.
• Such algebraic operations on the submatrices are called block matrix operations.– useful in matrix multiplication as well as in a variety of other matrix
algorithms
• In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix.
• We perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.– requiring (n/q)3 additions and multiplications
Sahalu Junaidu ICS 573: High Performance Computing 8.4
Block Matrix Operations
Sahalu Junaidu ICS 573: High Performance Computing 8.5
A Simple Parallel Matrix-Matrix Multiplication Algorithm
• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows and B along columns.
• Perform local submatrix multiplication.
Sahalu Junaidu ICS 573: High Performance Computing 8.6
Matrix-Matrix Multiplication: Performance Analysis
• The two broadcasts take time
• The computation requires multiplications of sized submatrices.
• The parallel run time is approximately
Sahalu Junaidu ICS 573: High Performance Computing 8.7
Drawback of the Simple Parallel Algorithm
• A major drawback of this algorithm is that it is not memory optimal
• Each process has blocks of both matrices A and B at the end of each communication phase
• Thus, each process requires memory– Since each block requires memory
• The total memory requirement over all the processes is
i.e., times the memory requirement of the sequential algorithm.
)/( 2 pn
)/( 2 pn
)*( 2 pn
p
p
Sahalu Junaidu ICS 573: High Performance Computing 8.8
Matrix-Matrix Multiplication: Cannon's Algorithm
• Cannon's algorithm is a memory-efficient version of the simple parallel algorithm – With a total memory requirement of (n2)
• Matrices A and B are partitioned into p square blocks as in the simple parallel algorithm
• Although every process in the ith row requires all submatrices, the all-to-all broadcast can be avoided by– scheduling the computations of the processes of the ith row such
that, at any given time, each process is using a different block Ai,k.
– systematically rotating these blocks among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.
• If an identical schedule is applied to the columns of B, then no process holds more than one block of each matrix at any time
p
p
Sahalu Junaidu ICS 573: High Performance Computing 8.9
Communication Steps in Cannon's Algorithm
Sahalu Junaidu ICS 573: High Performance Computing 8.10
Communication Steps in Cannon's Algorithm
• First, align the blocks of A and B in such a way that each process multiplies its local submatrices:– shift submatrices Ai,j to the left (with wraparound) by i steps
– shift submatrices Bi,j up (with wraparound) by j steps.
• After alignment (Figure 8.3c):– Process Pi,j has submatrices and .
– Perform local block matrix multiplication.
• Next, each block of A moves one step left and each block of B moves one step up (again with wraparound).
• Perform next block multiplication, add to partial result, repeat until all the blocks have been multiplied.
p
pijiA
mod)(, jpijB
,mod)(
Sahalu Junaidu ICS 573: High Performance Computing 8.11
Cannon's Algorithm: An Example• Consider the matrices to be multiplied:
• Assume that these matrices are portioned into 4 square blocks as follows:
• After the initial alignment, matrices A and B become:
2763
4429
6170
3512
A
5804
8891
5654
3216
B
1,10,1
1,00,0
27
44
63
29
61
35
70
12
PP
PP
A
1,10,1
1,00,0
58
88
04
91
56
32
54
16
PP
PP
B
63
29
27
44
61
35
70
12
A
56
32
04
91
58
88
54
16
B
Sahalu Junaidu ICS 573: High Performance Computing 8.12
Cannon's Algorithm: An Example
• After this alignment, process– P0,0 ends up with A0,0 and B0,0 and should compute c0,0
– P0,1 ends up with A0,1 and B1,1 and should compute c0,1
– P1,0 ends up with A1,1 and B1,0 and should compute c1,0
– P1,1 ends up with A1,0 and B0,1 and should compute c1,1
• The local block matrix multiplications proceed as follows:
3942
3730
56
32
63
29
6315
3620
04
91
27
44
2240
2516
58
88
61
35
3528
716
54
16
70
12
1,1
0,1
1,0
0,0
xC
xC
xC
xC
Sahalu Junaidu ICS 573: High Performance Computing 8.13
Cannon's Algorithm: An Example• Shift 1: shift each block of A one step to the left and shift each block of B one step
up:
• Next, each process Pi,j performs block multiplication, updating Ci,j :
782
2530
4640
120
3942
3730
58
88
27
44
9657
5582
3342
1962
6315
3620
54
16
63
29
572
1426
3542
1110
2240
2516
56
32
70
12
4453
5233
925
4517
3528
716
04
91
61
35
1,11,1
0,10,1
1,01,0
0,00,0
xCC
xCC
xCC
xCC
27
44
63
29
70
12
61
35
A
58
88
54
16
56
32
04
91
B
Sahalu Junaidu ICS 573: High Performance Computing 8.14
Cannon's Algorithm: Performance Analysis
• In the alignment step, the maximum distance over which a block shifts is , – the two shift operations require a total of time.
• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time.
• The computation time for multiplying matrices of size is .
• The parallel time is approximately:
Sahalu Junaidu ICS 573: High Performance Computing 8.15
MPI_Cart_shift Function
• Shifting data along the dimensions of the 2-D mesh is a frequent operation in the Cannon’s algorithm– MPI provides the function MPI_Cart_shift for this purpose.
int MPI_Cart_shift( MPI_Comm comm_cart,/* communicator with
Cartesian structure (handle)*/
int dir, /* direction of shift (> 0: up shift, < 0: down shift) */
int s_step, /* shift size/displacement */
int *rank_source, /* rank of source process */ int *rank_dest) /* rank of destination process */
• Here is an example program exercising this function.
Sahalu Junaidu ICS 573: High Performance Computing 8.16
Sending and Receiving Messages Simultaneously
• To exchange messages, MPI provides the following function:
int MPI_Sendrecv(void *sendbuf, int sendcount,
MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
• The arguments include arguments to the send and receive functions. • If we wish to use the same buffer for both send and receive, we can use:
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
• A Parallel program for Cannon’s algorithm is here.
Sahalu Junaidu ICS 573: High Performance Computing 8.17
Overlapping Communication with Computation
• Our MPI programs so far used blocking send/receive operations to perform point-to-point communication.
• As discussed earlier,– a blocking send operation remains blocked until the message has been copied out
of the send buffer– a blocking receive operation returns only after the message has been received and
copied into the receive buffer.
• In the Cannon algorithm, for example, each process blocks on MPI_Sendrecv_replace
– until the specified matrix block has been sent and received by the corresponding processes.
• Note that the blocks of matrices A and B do not change as they are shifted among the processors
– Thus, we can overlap the transmission of these blocks with the computation for the matrix-matrix multiplication
– Many recent distributed-memory parallel computers have dedicated communication controllers that can perform the transmission of messages without interrupting the CPUs.
Sahalu Junaidu ICS 573: High Performance Computing 8.18
Non-Blocking Communication Operations
• In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations.
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,MPI_Request
*request)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request
*request)
• These operations return before the operations have been completed.• Function MPI_Test tests whether or not the non-blocking send or receive
operation identified by its request has finished.
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
• MPI_Wait waits for the operation to complete. An example is here.
int MPI_Wait(MPI_Request *request, MPI_Status *status)
Sahalu Junaidu ICS 573: High Performance Computing 8.19
Canon’s Algorithm using Non-Blocking Operations
• Here is the parallel program for Cannon’s algorithm using nonblocking operations
• Two main differences between this program and the earlier one using blocking operations:
1. Additional arrays a_buffers and b_buffers, are used for the blocks of A and B that are being received while the computation involving the previous blocks is performed.
2. in the main computational loop, it first starts the non-blocking send operations to send the locally stored blocks of A and B to the processes left and up the grid, and then starts the non-blocking receive operations to receive the blocks for the next iteration from the processes right and down the grid.
• After starting these four non-blocking operations, it proceeds to perform the matrix-matrix multiplication of the blocks it currently stores.
• Finally, before it proceeds to the next iteration, it uses MPI_Wait to wait for the send and receive operations to complete.