cse 160 - lecture 11

CSE 160 - Lecture 11

Computation/Communication Analysis - Matrix Multiply

Granularity

• Machine granularity has been defined as MFLOPS/MB/sec = FLOP/Byte– This tries to indicate the balance between

computation and communication.

• For parallel computation it is important to understand how much computation could be accomplished while sending/receiving a message

Message Startup Latency

• Granularity as defined only tells part of the story– If it told the whole story than message startup latency

would not be important.• Message startup latency - the time it takes to start sending a

message of any length

• This latency is approximated by measuring the latency of a zero-byte message

• There are other measures that are important, too.

Back of the envelope calculations

• Suppose we have a 733MhZ Pentium, Myrinet (100 MB/sec) and zero-length message latency of 10 microseconds– Granularity is 733/100 = 7.3

• 7 Flops can be computed for every byte of data sent– Double precision float is 8 bytes– (8*7) = 56 Flops for every DP float sent on the network (hmmm)

– in 10 microseconds can accomplish 7333 Flops

• Every float takes ~ .08 microseconds to transmit– 100MB/sec = 100 bytes/microsecond – 125 floats transmitted/startup latency

First Interpretation

• For a 50:50 balance (Comp/Comm)– Compute 7333 Flops, Transmit 1 float

– If done serially (compute -> message -> compute -> …)

• Throughput of CPU cut in 1/2 – Only computing 1/2 of the time, messaging the other half

• For a 90:10 (comp/comm)– Compute (9*7333 66000) Flops/ transmit 1 float

– If done serially, 90% time computing, 10% messaging

One more calculation

• 733 MHz PIII, 100 Mbit ethernet, 100 microsecond latency– granularity is 733/10 = 73.3– in one latency period can do 73333 Flops– 90:10 requires (9 * 73333) = 666000 Flops– If latency isn’t the contraint, but transmission

time is, then we have to balance the computation time with the communication time

Matrix - Matrix Multiply

• Given two square matrices (N x N) A, B, want to multiply them together– The total number of FLOPs is O(2N3)

• There are numerous ways to efficiently parallelize matrix multiply, we’ll pick a simple method and analyze is communication and computation costs

Matrix Multiply - Review4x4 example

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A34 A44

B11 B12 B13 B14

B21 B22 B23 B24

B31 B32 B33 B34

B41 B42 B34 B44

C11 C12 C13 C14

C21 C22 C23 C24

C31 C32 C33 C34

C41 C42 C34 C44

= *

A32 = B31

B32

B33

B34

C12

C22

C32

C42

In general, entry Akm = dot product of kth row of B with the mth column of C

= B31 C12+ B32 C22+ B33 C32+ B34 C42

How much computation for NxN?

• Akm = Bkj Cjm j=1,2,…,N– Count multiplies – N– Count adds – N– So every element of A takes 2N Flops

• There are N*N elements in the results– Total is (flops/element)*(#elements)

• (2N)*(N*N) = O(2N3)

The matrix elements can be matrices!

• Matrix elements can themselves be matrices. – E.g. B31 C12 would itself be a matrix multiply

• We can think about matrix blocks of size qXq. – Total computation is (#blocks)*2q3

• Let’s formulate two parallel algorithms for MM

First question to ask

• What can be computed independently?– Does the computation of Akm depend at all on

the computation of Apq?• Another way of asking, does the order in which we

compute the elements of A matter?– Not for MM

– In general, calculations will depend on previous results, This might limit the amount of parallelism

Simple Parallelism

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A34 A44

Divide the computation of A into blocks.

Assign a block to different processors (16 in this case)

Computation of each block of A can be computed in parallel

In theory, should be 16X faster than on one processor

Computation of A mapped onto a 4x4 Processor grid

Next Questions to Ask

• How will the three matrices be stored in a parallel program. Two choices– Every node gets a complete copy of all the

matrices (could run out of memory)– Distribute the storage to different computers so

that each “node” holds only some part of each matrix

• Can compute on much larger matrices!

Every Processor gets copy of the Matrices

• This is a good paradigm for a shared-memory machine?– Every processor can share their copy of the matrix

• For distributed memory machines, need p*p more total memory on a pXp processor grid.

• May not be practical for extremely large matrices – one 1000x1000 DP matrix is 8MB.

• If we ignore cost of initial distribution of matrices across multiple memories, then parallel MM multiply runs p*p times faster than on a single CPU.

Case Two

• Matrices A, B, and C are distributed so that each processor has only some part of the matrix.

• We call this a parallel data decomposition– Examples include Block, Cyclic and Strip

• Why would you do this?– If you only need to multiply two matrices, then today’s

machines can locally store fairly large matrices, but• Codes often contain 10’s to 100’s of similar sized matrices.

– Local memory starts to become a relatively scarce commodity

Let’s Pick A Row Decomposition

• Basic idea - – Assume P processors and N = q*P – Assign q rows of the matrices to each processor

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A34 A44

B11 B12 B13 B14

B21 B22 B23 B24

B31 B32 B33 B34

B41 B42 B34 B44

C11 C12 C13 C14

C21 C22 C23 C24

C31 C32 C33 C34

C41 C42 C34 C44

= *

Proc 0

Proc 1

Proc 2

Proc 3

Akm is qXq

What do we need to compute a block

• Consider a block Akm

– Need the kth row of B, and the mth column of C– Where are these located

• All of the kth row of B is on the same processor as that holds the kth row of A

– Why? Our chosen decomposition of data

• The mth column is distributed among all processors– Why? Our chosen decomposition.

Let’s assume some memory contraints

• Each process has just enough memory to hold 3 qxN matrices– enough to get all data in place to easily compute the a

row/column dot product• Can do this on each processor one qXq block at a time.

– How much computation is needed to compute a single qXq entry in a row

• 2q3FLOPS

– How much data needs to be transferred?• 3*q2 – need qXq blocks of data from 3 neighbors (column of

C)

Basic Algorithm - SPMD

Assume my processor id = z (z = 1, …, p)P is the number of processorsBlocks are qXqFor (j = 1; j < p; j++) {for (k = 1; k <= p , k z; k++)

send Czj to processor kfor (k = 1; k <= p , k z; k++)

receive Ckj from processor k

compute Azj locally barrier();

}

Let’s compute some work

• For each iteration, each process does the following– Computes 2q3 flops

– Send/receives 3q2 floats

• We’ve added the transmission of data to the basic algorithm.

• For what size q does the time required for data transmission balance the time for computation?

Some Calculations

• Using just a granularity measure how do we choose a minimum q (for 50:50)– On Myrinet, need to perform 56 flops for every float

transferred– Need to perform 2q3 flops/iteration. Assume a flop

costs 1 unit. – Each float transferred “costs” 56 units– So when is

• Flop cost >= transfer cost• 2q3 >= 3*56q2? Q = 80

– For a size 80 matrix, 733Mflop this takes 1.4ms

What does 50:50 really mean

• On two processors, it takes the same amount of time as one

• On four, with scaling, goes twice as fast– 50% efficiency

Final Thoughts

• This decomposition of data is not close to optimal– We could get more parallelism by running on a

pXp processor grid and having each processor do just one multiply

– MM is highly parallelizable and better algorithms get good efficiencies even on workstations.

cse 160 - lecture 11

Documents

computation time

computation of akm

n elements

parallel computation

total computation

computation costsmatrix

computation of apq

length message latency