jim demmel eecs & math departments, uc berkeley [email protected]
DESCRIPTION
Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Case Study: Matrix Multiply. Jim Demmel EECS & Math Departments, UC Berkeley [email protected]. Why Matrix Multiplication?. An important kernel in many problems Appears in many linear algebra algorithms - PowerPoint PPT PresentationTRANSCRIPT
1
Minimizing Communication in Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Case Study: Matrix Multiply
Jim DemmelEECS & Math Departments, UC Berkeley
Summer School-Lecture 22
Why Matrix Multiplication?
• An important kernel in many problems
• Appears in many linear algebra algorithms
• Bottleneck for dense linear algebra
• Closely related to other algorithms, e.g., transitive closure on a graph using Floyd-Warshall
• Optimization ideas can be used in other problems
• The best case for optimization payoffs
• The most-studied algorithm in high performance computing
Summer School-Lecture 23
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Summer School-Lecture 24
Note on Matrix Storage
• A matrix is a 2-D array of elements, but memory addresses are “1-D”
• Conventions for matrix layout• by column, or “column major” (Fortran default); A(i,j) at A+i+j*n• by row, or “row major” (C default) A(i,j) at A+i*n+j• recursive (later)
• Column major (for now)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
4
8
12
16
1
5
9
13
17
2
6
10
14
18
3
7
11
15
19
Column major Row major
cachelinesBlue row of matrix is stored in red cachelines
Figure source: Larry Carter, UCSD
Column major matrix in memory
Summer School-Lecture 25
Computational Intensity: Key to algorithm efficiency
Machine Balance: Key to machine efficiency
Using a Simple(r) Model of Memory to Optimize
• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory
• m = number of memory elements (words) moved between fast and slow memory
• tm = time per slow memory operation
• f = number of arithmetic operations
• tf = time per arithmetic operation << tm
• q = f / m average number of flops per slow memory access
• Minimum possible time = f* tf when all data in fast memory
• Actual time • f * tf + m * tm = f * tf * (1 + tm/tf * 1/q)
• Larger q means time closer to minimum f * tf • q tm/tf needed to get at least half of peak speed
Summer School-Lecture 26
Warm up: Matrix-vector multiplication
{implements y = y + A*x}for i = 1:n
for j = 1:ny(i) = y(i) + A(i,j)*x(j)
= + *
y(i) y(i)
A(i,:)
x(:)
Summer School-Lecture 27
Warm up: Matrix-vector multiplication
{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n
{read row i of A into fast memory} for j = 1:n
y(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}
• m = number of slow memory refs = 3n + n2
• f = number of arithmetic operations = 2n2
• q = f / m 2
• Matrix-vector multiplication limited by slow memory speed
Summer School-Lecture 28
Modeling Matrix-Vector Multiplication
• Compute time for nxn = 1000x1000 matrix• Time
• f * tf + m * tm = f * tf * (1 + tm/tf * 1/q)
• = 2*n2 * tf * (1 + tm/tf * 1/2)
• For tf and tm, using data from R. Vuduc’s PhD (pp 351-3)• http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf
• For tm use minimum-memory-latency / words-per-cache-line Clock Peak Linesize t_m/t_fMHz Mflop/s Bytes
Ultra 2i 333 667 38 66 16 24.8Ultra 3 900 1800 28 200 32 14.0Pentium 3 500 500 25 60 32 6.3Pentium3M 800 800 40 60 32 10.0Power3 375 1500 35 139 128 8.8Power4 1300 5200 60 10000 128 15.0Itanium1 800 3200 36 85 32 36.0Itanium2 900 3600 11 60 64 5.5
Mem Lat (Min,Max) cycles machine
balance(q must be at leastthis for ½ peak speed)
Summer School-Lecture 29
Simplifying Assumptions
• What simplifying assumptions did we make in this analysis?
• Ignored parallelism in processor between memory and arithmetic within the processor
• Sometimes drop arithmetic term in this type of analysis
• Assumed fast memory was large enough to hold three vectors• Reasonable if we are talking about any level of cache• Not if we are talking about registers (~32 words)
• Assumed the cost of a fast memory access is 0• Reasonable if we are talking about registers• Not necessarily if we are talking about cache (1-2 cycles for L1)
• Memory latency is constant
• Could simplify even further by ignoring memory operations in X and Y vectors
• Mflop rate/element = 2 / (2* tf + tm)
Summer School-Lecture 210
Validating the Model
• How well does the model predict actual performance? • Actual DGEMV: Most highly optimized code for the platform
• Model sufficient to compare across machines• But under-predicting on most recent ones due to latency estimate
0
200
400
600
800
1000
1200
1400
Ultra 2i Ultra 3 Pentium 3 Pentium3M Power3 Power4 Itanium1 Itanium2
MFl
op/s
Predicted MFLOP(ignoring x,y)
Pre DGEMV Mflops(with x,y)
Actual DGEMV(MFLOPS)
Summer School-Lecture 211
Naïve Matrix Multiply
{implements C = C + A*B}for i = 1 to n for j = 1 to n
for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
= + *
C(i,j) C(i,j) A(i,:)
B(:,j)
Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory
q potentially as large as 2*n3 / 3*n2 = O(n)
Summer School-Lecture 212
Naïve Matrix Multiply
{implements C = C + A*B}for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
= + *
C(i,j) A(i,:)
B(:,j)C(i,j)
Summer School-Lecture 213
Naïve Matrix Multiply
Number of slow memory references on unblocked matrix multiplym = n3 to read each column of B n times
+ n2 to read each row of A once + 2n2 to read and write each element of C once = n3 + 3n2
So q = f / m = 2n3 / (n3 + 3n2) 2 for large n, no improvement over matrix-vector multiply
Inner two loops are just matrix-vector multiply, of row i of A times BSimilar for any other order of 3 loops
= + *
C(i,j) C(i,j) A(i,:)
B(:,j)
Summer School-Lecture 214
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Summer School-Lecture 215
Naïve Matrix Multiply on RS/6000
-1
0
1
2
3
4
5
6
0 1 2 3 4 5
log Problem Size
log
cycl
es/fl
opT = N4.7
O(N3) performance would have constant cycles/flopPerformance looks like O(N4.7)
Size 2000 took 5 days
12000 would take1095 years
Slide source: Larry Carter, UCSD
Summer School-Lecture 216
Naïve Matrix Multiply on RS/6000
Slide source: Larry Carter, UCSD
0
1
2
3
4
5
6
0 1 2 3 4 5
log Problem Size
log
cycl
es/fl
op
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations Page miss every 512 iterations
Summer School-Lecture 217
Blocked (Tiled) Matrix Multiply
Consider A,B,C to be N-by-N matrices of b-by-b subblocks where b=n / N is called the block size for i = 1 to N
for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}
= + *
C(i,j) C(i,j) A(i,k)
B(k,j)
Summer School-Lecture 218
Blocked (Tiled) Matrix Multiply
Recall: m is amount memory traffic between slow and fast memory matrix has nxn elements, and NxN blocks each of size bxb f is number of floating point operations, 2n3 for this problem q = f / m is our measure of algorithm efficiency in the memory systemSo:
m = N*n2 read each block of B N3 times (N3 * b2 = N3 * (n/N)2 = N*n2) + N*n2 read each block of A N3 times + 2n2 read and write each block of C once = (2N + 2) * n2
So computational intensity q = f / m = 2n3 / ((2N + 2) * n2) n / N = b for large nSo we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiply (q=2)
Summer School-Lecture 219
Analyzing Machine Speed Limits
The blocked algorithm has computational intensity q b• The larger the block size, the more efficient our algorithm will be• Limit: All three blocks from A,B,C must fit in fast memory (cache), so
we cannot make these blocks arbitrarily large
• Assume your fast memory has size Mfast
3b2 Mfast, so q b (Mfast/3)1/2
requiredt_m/t_f KB
Ultra 2i 24.8 14.8Ultra 3 14 4.7Pentium 3 6.25 0.9Pentium3M 10 2.4Power3 8.75 1.8Power4 15 5.4Itanium1 36 31.1Itanium2 5.5 0.7
• To build a machine to run matrix multiply at 1/2 peak arithmetic speed of the machine, we need a fast memory of size
Mfast 3b2 3q2 = 3(tm/tf)2
• This size is reasonable for L1 cache, but not for register sets
• Note: analysis assumes it is possible to schedule the instructions perfectly
Summer School-Lecture 220
Limits to Optimizing Matrix Multiply
• The blocked algorithm changes the order in which values are accumulated into each C[i,j] by applying commutativity and associativity
• Get slightly different answers from naïve code, because of roundoff - OK
• The previous analysis showed that the blocked algorithm has computational intensity:
q b (Mfast/3)1/2
• There is a lower bound result that says we cannot do any better than this (using only associativity)
• Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q = O( (Mfast)
1/2 )• Does not apply to algorithms like Strassen
Summer School-Lecture 2
Lower bound for all “direct” linear algebra
• Holds for sequential algorithms for• Matmul, BLAS, LU, QR, eig, SVD, …• Some whole programs (sequences of these operations, no
matter how they are interleaved, eg computing Ak)• Dense and sparse matrices (where #flops << n3 )• Some graph-theoretic algorithms (eg Floyd-Warshall)• Proof in Lecture 3
21
Let M = “fast” memory size
#words_moved = (#flops / M1/2 )
#messages_sent = (#flops / M3/2 )
Summer School-Lecture 222
What if there are more than 2 levels of memory?
• Recall goal is to minimize communication between all levels• The tiled algorithm requires finding a good block size
• Machine dependent• Need to “block” b x b matrix multiply in inner most loop
• 1 level of memory 3 nested loops (naïve algorithm)• 2 levels of memory 6 nested loops• 3 levels of memory 9 nested loops …
• Cache Oblivious Algorithms offer an alternative• Treat nxn matrix multiply as a set of smaller problems• Eventually, these will fit in cache• Will minimize # words moved between every level of memory
hierarchy (between L1 and L2 cache, L2 and L3, L3 and main memory etc.) – at least asymptotically
Summer School-Lecture 2
Recursive Matrix Multiplication (RMM) (1/2)
• For simplicity: square matrices with n = 2m
• C = = A · B = · ·
=
• True when each Aij etc 1x1 or n/2 x n/2
23
A11 A12
A21 A22
B11 B12
B21 B22
C11 C12
C21 C22
A11·B11 + A12·B21 A11·B12 + A12·B22
A21·B11 + A22·B21 A21·B12 + A22·B22
func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
Summer School-Lecture 2
Recursive Matrix Multiplication (2/2)
24
func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
A(n) = # arithmetic operations in RMM( . , . , n)
= 8 · A(n/2) + 4(n/2)2 if n > 1, else 1
= 2n3 … same operations as usual, in different order
M(n) = # words moved between fast, slow memory by RMM( . , . , n)
= 8 · M(n/2) + 4(n/2)2 if 3n2 > Mfast , else 3n2
= O( n3 / (Mfast )1/2 + n2 ) … same as blocked matmul
Summer School-Lecture 225
Recursion: Cache Oblivious Algorithms
• Recursion for general A (mxn) * B (nxp)
• Case1: m>= max{n,p}: split A horizontally:• Case 2 : n>= max{m,p}: split A vertically and B horizontally • Case 3: p>= max{m,n}: split B vertically
• Attains lower bound in O() sense
BA
BAB
A
A
2
1
2
1
2121 ,, BABABBA
Case 1
Case 3
BABAB
BAA 21
2
121,
Case 2
1 2
Summer School-Lecture 2
Experience with Cache-Oblivious Algorithms
• In practice, need to cut off recursion well before 1x1 blocks• Call “Micro-kernel” for small blocks, eg 16 x 16• Implementing a high-performance Cache-Oblivious code is not easy
• Using fully recursive approach with highly optimized recursive micro-kernel, Pingali et al report that they never got more than 2/3 of peak.
• Issues with Cache Oblivious (recursive) approach• Recursive Micro-Kernels yield less performance than iterative ones
using same scheduling techniques• Pre-fetching is needed to compete with best code: not well-understood
in the context of Cache Oblivous codes
Unpublished work, presented at LACSI 2006
Summer School-Lecture 2
Minimizing latency requires new data structures• To minimize latency, need to load/store whole rectangular
subblock of matrix with one “message”• Incompatible with conventional columnwise (rowwise) storage
• Ex: Rows (columns) not in contiguous memory locations
• Blocked storage: store as matrix of bxb blocks, each block stored contiguously
• Ok for one level of memory hierarchy, what if more?
• Recursive blocked storage: store each block using subblocks• Also known as “space filling curves”, “Morton ordering”
27
Summer School-Lecture 2
28
Comparing matrix-vector to matrix-matrix mult
BLAS3 (MatrixMatrix) vs. BLAS2 (MatrixVector)
0100200300400500600700800900
1000
AMD A
thlon
-600
DEC ev5
6-53
3
DEC ev6
-500
HP9000
/735
/135
IBM
PPC60
4-11
2
IBM
Pow
er2-
160
IBM
Pow
er3-
200
Pentiu
m P
ro-2
00
Pentiu
m II
-266
Pentiu
m II
I-550
SGI R10
000ip
28-2
00
SGI R12
000ip
30-2
70
MF
lop
/s
DGEMM
DGEMV
Data source: Jack Dongarra
• Matrix-matrix mult ≡ DGEMM, matrix-vector mult ≡ DGEMV
Summer School-Lecture 2
How hard is hand-tuning matmul, anyway?
29
• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09• Students given “blocked” code to start with
• Still hard to get close to vendor tuned performance (ACML)• For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
Summer School-Lecture 2
How hard is hand-tuning matmul, anyway?
30
Summer School-Lecture 2
31
What part of the Matmul Search Space Looks Like
A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)
Num
ber o
f col
umns
in r e
g is t
er b
l ock
Number of rows in register block
Finding needlein haystack!
Summer School-Lecture 2
Automatic Performance Tuning
• Goal: Let machine do hard work of writing fast code
• What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common?
• Can do the tuning off-line: once per architecture, algorithm• Can take as much time as necessary (hours, a week…)• At run-time, algorithm choice may depend only on few parameters
(matrix dimensions, size of FFT, etc.)• Examples: PHiPAC, ATLAS, FFTW, Spiral
• Can’t always do tuning off-line• Algorithm and implementation may strongly depend on data only known
at run-time• Ex: Sparse matrix nonzero pattern determines both best data structure
and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time• Example: OSKI
32
Summer School-Lecture 2
33
Autotuning Matmul with ATLAS (n = 500)
• ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.
• ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
Architectures
MF
LO
PS
Vendor BLASATLAS BLASF77 BLAS
Source: Jack Dongarra
Summer School-Lecture 2
Parallel matrix-matrix multiplication
• Consider distributed memory machines• Each processor has its own private memory• Communication by sending messages over a network• Examples: MPI, UPC, Titanium
• First question: how is matrix initially distributed across different processors?
34
Summer School-Lecture 2
Different Parallel Data Layouts for Matrices (not all!)
0123012301230123
0 1 2 3 0 1 2 3
1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout
3) 1D Column Block Cyclic Layout
4) Row versions of the previous layouts
Generalizes others
0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column
Block Cyclic Layout
0 1 2 3
0 1
2 3
5) 2D Row and Column Blocked Layout
b
35
Summer School-Lecture 2 36
Parallel Matrix-Vector Product• Compute y = y + A*x, where A is a dense matrix• Layout:
• 1D row blocked
• A(i) refers to the n by n/p block row that processor i owns
• x(i) and y(i) similarly refer to segments of x,y owned by i
• Algorithm:• Foreach processor i• Broadcast x(i)• Compute y(i) = A(i)*x
• Algorithm uses the formulay(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)
x
y
P0
P1
P2
P3
P0 P1 P2 P3
A(0)
A(1)
A(2)
A(3)
Summer School-Lecture 237
Matrix-Vector Product y = y + A*x
• A column layout of the matrix eliminates the broadcast of x• But adds a reduction to update the destination y
• A 2D blocked layout uses a broadcast and reduction, both on a subset of processors
• sqrt(p) for square processor grid
P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
Summer School-Lecture 238
Parallel Matrix Multiply
• Computing C=C+A*B• Using basic algorithm: 2*n3 Flops• Depends on:
• Data layout• Topology of machine • Scheduling communication
• Use of simple performance model for algorithm design• Message Time = “latency” + #words * time-per-word
= + #words *• Efficiency:
• serial time / (p * parallel time)• perfect (linear) speedup efficiency = 1
Summer School-Lecture 239
Matrix Multiply with 1D Column Layout
• Assume matrices are n x n and n is divisible by p
• A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))
• B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p
• Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
p0 p1 p2 p3 p5 p4 p6 p7
May be a reasonable assumption for analysis, not for code
Summer School-Lecture 240
Matrix Multiply: 1D Layout on Bus or Ring
• Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
• First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)
• Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously
Summer School-Lecture 241
MatMul: 1D layout on Bus without Broadcast
Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)
for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier
Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /p
Summer School-Lecture 2 42
Naïve MatMul (continued)
Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /p … approximately
Only 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serial
Running time: = (p*(p-1) + 1)*computation + p*(p-1)*communication 2*n3 + p2* + p*n2*
This is worse than the serial time and grows with p.When might you still want to do this?
Summer School-Lecture 2 43
Matmul for 1D layout on a Processor Ring
• Pairs of adjacent processors can communicate simultaneously
Copy A(myproc) into Tmp
C(myproc) = C(myproc) + Tmp*B(myproc , myproc)
for j = 1 to p-1
Send Tmp to processor myproc+1 mod p
Receive Tmp from processor myproc-1 mod p
C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)
• Need to be careful about talking to neighboring processors
• May want double buffering in practice for overlap
• Ignoring deadlock details in code• Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2
Summer School-Lecture 244
Matmul for 1D layout on a Processor Ring
• Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2
• Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop• 2*n3/p + 2*p* + 2**n2
• (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:• Perfect speedup for arithmetic• A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words
• Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 + * p2/(2*n3) + * p/(2*n) ) = 1/ (1 + O(p/n))• Grows to 1 as n/p increases (or and shrink)
Summer School-Lecture 245
MatMul with 2D Layout
• Consider processors in 2D grid (physical or logical)• Processors communicate with 4 nearest neighbors
• Assume p processors form square s x s grid, s = p1/2
p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
= *
Summer School-Lecture 246
Cannon’s Algorithm
… C(i,j) = C(i,j) + A(i,k)*B(k,j)… assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1
k
Summer School-Lecture 247
C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)
Cannon’s Matrix Multiplication
Summer School-Lecture 2 48
Initial Step to Skew Matrices in Cannon
• Initial blocked input
• After skewing before initial block multiplies
A(1,0)
A(2,0)
A(0,1) A(0,2)
A(1,1)
A(2,1)
A(1,2)
A(2,2)
A(0,0)
B(0,1) B(0,2)
B(1,0)
B(2,0)
B(1,1) B(1,2)
B(2,1) B(2,2)
B(0,0)
A(1,0)
A(2,0)
A(0,1) A(0,2)
A(1,1)
A(2,1)
A(1,2)
A(2,2)
A(0,0)
B(0,1)
B(0,2)B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)B(0,0)
Summer School-Lecture 249
Skewing Steps in CannonAll blocks of A must multiply all like-colored blocks of B
• First step
• Second
• Third
A(1,0)
A(2,0)
A(0,1) A(0,2)
A(1,1)
A(2,1)
A(1,2)
A(2,2)
A(0,0)
B(0,1)
B(0,2)B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)B(0,0)
A(1,0)
A(2,0)
A(0,1) A(0,2)
A(2,1)
A(1,2) B(0,1)
B(0,2)B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)B(0,0)
A(1,0)
A(2,0)
A(0,1)A(0,2)
A(1,1)
A(2,1)
A(1,2)
A(2,2)
A(0,0) B(0,1)
B(0,2)B(1,0)
B(2,0)
B(1,1)
B(1,2)
B(2,1)
B(2,2)B(0,0)
A(1,1)
A(2,2)
A(0,0)
Summer School-Lecture 2 50
Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … cost ≤ s*( + *n2/p) forall i=0 to s-1 up-circular-shift column i of B by i … cost ≤ s*( + *n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1
C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2
left-circular-shift each row of A by 1 … cost = + *n2/p up-circular-shift each column of B by 1 … cost = + *n2/p
° Total Time = 2*n3/p + 4* s* + 4**n2/s ° Parallel Efficiency = 2*n3 / (p * Total Time)
= 1/( 1 + * 2*(s/n)3 + * 2*(s/n) ) = 1/(1 + O(sqrt(p)/n)) ° Grows to 1 as n/s = n/sqrt(p) = sqrt(data per processor) grows° Better than 1D layout, which had Efficiency = 1/(1 + O(p/n))
Summer School-Lecture 2
Lower bound for all “direct” linear algebra
• Holds for parallel and sequential algorithms for • Matmul (attained by Cannon), BLAS, LU, QR, eig, SVD, …• Some whole programs (sequences of these operations, no
matter how they are interleaved, eg computing Ak)• Dense and sparse matrices (where #flops << n3 )• Some graph-theoretic algorithms (eg Floyd-Warshall)• Proof in Lecture 3
51
Let M = “fast” memory size = O(n2/p) per processor
#words_moved (by at least one processor) = (#flops / M1/2 ) = ((n3/p) / (n2/p)1/2 ) = (n2/p1/2 )
#messages_sent (by at least one processor) = (#flops / M3/2 ) = (p1/2 )
Summer School-Lecture 2 52
Pros and Cons of Cannon• So what if it’s “optimal”, is it fast?
• Yes: Local computation one call to (optimized) matrix-multiply
• Hard to generalize for• p not a perfect square• A and B not square• Dimensions of A, B not perfectly divisible by s=sqrt(p)• A and B not “aligned” in the way they are stored on processors• block-cyclic layouts
• Memory hog (extra copies of local matrices)
Summer School-Lecture 2 53
SUMMA Algorithm
• SUMMA = Scalable Universal Matrix Multiply • Slightly less efficient than Cannon, but simpler and easier to
generalize• Can accommodate any layout, dimensions, alignment• Uses broadcast of submatrices instead of circular shifts• Sends log p times as much data as Cannon• Can use much less extra memory than Cannon, but send more
messages
• Similar ideas appeared many times
• Used in practice in PBLAS = Parallel BLAS• www.netlib.org/lapack/lawns/lawn{96,100}.ps
Summer School-Lecture 2 54
SUMMA
* =i
j
A(i,k)
k
k
B(k,j)
• i, j represent all rows, columns owned by a processor• k is a block of b 1 rows or columns
• C(i,j) = C(i,j) + k A(i,k)*B(k,j)
• Assume a pr by pc processor grid (pr = pc = 4 above) • Need not be square
C(i,j)
Summer School-Lecture 255
SUMMA
For k=0 to n-1 … or n/b-1 where b is the block size
… = # cols in A(i,k) and # rows in B(k,j)
for all i = 1 to pr … in parallel
owner of A(i,k) broadcasts it to whole processor row
for all j = 1 to pc … in parallel
owner of B(k,j) broadcasts it to whole processor column
Receive A(i,k) into Acol
Receive B(k,j) into Brow
C_myproc = C_myproc + Acol * Brow
* =i
j
A(i,k)
k
k
B(k,j)
C(i,j)
Summer School-Lecture 256
SUMMA performance
For k=0 to n/b-1
for all i = 1 to s … s = sqrt(p)
owner of A(i,k) broadcasts it to whole processor row
… time = log s *( + * b*n/s), using a tree
for all j = 1 to s
owner of B(k,j) broadcasts it to whole processor column
… time = log s *( + * b*n/s), using a tree
Receive A(i,k) into Acol
Receive B(k,j) into Brow
C_myproc = C_myproc + Acol * Brow
… time = 2*(n/s)2*b
° Total time = 2*n3/p + * log p * n/b + * log p * n2 /s
° To simplify analysis only, assume s = sqrt(p)
Summer School-Lecture 2 57
SUMMA performance
• Total time = 2*n3/p + * log p * n/b + * log p * n2 /s
• Parallel Efficiency =
1/(1 + * log p * p / (2*b*n2) + * log p * s/(2*n) )
• Same term as Cannon, except for log p factor
log p grows slowly so this is ok
• Latency () term can be larger, depending on b
When b=1, get * log p * n
As b grows to n/s, term shrinks to
* log p * s (log p times Cannon)
• Temporary storage grows like 2*b*n/s
• Can change b to tradeoff latency cost with memory
Summer School-Lecture 22/25/2009 58
PDGEMM = PBLAS routine for matrix multiply
Observations: For fixed N, as P increases Mflops increases, but less than 100% efficiency For fixed P, as N increases, Mflops (efficiency) rises
DGEMM = BLAS routine for matrix multiply
Maximum speed for PDGEMM = # Procs * speed of DGEMM
Observations (same as above): Efficiency always at least 48% For fixed N, as P increases, efficiency drops For fixed P, as N increases, efficiency increases
Summer School-Lecture 259
Summary of Parallel Matrix Multiplication• 1D Layout
• Bus without broadcast - slower than serial• Nearest neighbor communication on a ring (or bus with broadcast):
Efficiency = 1/(1 + O(p/n))
• 2D Layout• Cannon
• Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!• Hard to generalize for general p, n, block cyclic, alignment
• SUMMA• Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))• Very General• b small => less memory, lower efficiency• b large => more memory, high efficiency• Used in practice (PBLAS)
Summer School-Lecture 2
Why so much about matrix multiplication?
60
Summer School-Lecture 2
A brief history of (Dense) Linear Algebra software (1/5)
• Libraries like EISPACK (for eigenvalue problems)
• Then the BLAS (1) were invented (1973-1977)• Standard library of 15 operations (mostly) on vectors
• “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc• Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC
• Goals• Common “pattern” to ease programming, readability, self-
documentation• Robustness, via careful coding (avoiding over/underflow)• Portability + Efficiency via machine specific implementations
• Why BLAS 1 ? They do O(n1) ops on O(n1) data• Used in libraries like LINPACK (for linear systems)
• Source of the name “LINPACK Benchmark” (not the code!)
CS267 Lecture 10
• In the beginning was the do-loop…
Summer School-Lecture 2
A brief history of (Dense) Linear Algebra software (2/5)
• But the BLAS-1 weren’t enough• Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes • “Computational intensity” = #flops / #mem_refs = (2n)/(3n) = 2/3• Too low to run near peak speed (time for mem_refs dominates)• Hard to vectorize (“SIMD’ize”) on supercomputers of the day
(1980s)
• So the BLAS-2 were invented (1984-1986)• Standard library of 25 operations (mostly) on matrix/vector pairs
• “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, “TRSV”: y = T-1·x
• Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
• Why BLAS 2 ? They do O(n2) ops on O(n2) data
• So computational intensity still just ~(2n2)/(n2) = 2• OK for vector machines, but not for machine with caches
CS267 Lecture 10
Summer School-Lecture 2
A brief history of (Dense) Linear Algebra software (3/5)
• The next step: BLAS-3 (1987-1988)• Standard library of 9 operations (mostly) on matrix/matrix pairs
• “GEMM”: C = α·A·B + β·C, “SYRK”: C = α·A·AT + β·C, “TRSM”: C = T-1·B
• Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
• Why BLAS 3 ? They do O(n3) ops on O(n2) data
• So computational intensity (2n3)/(4n2) = n/2 – big at last!• Tuning opportunities machines with caches, other mem. hierarchy levels
• How much BLAS1/2/3 code so far (all at www.netlib.org/blas)• Source: 142 routines, 31K LOC, Testing: 28K LOC• Reference (unoptimized) implementation only
• Ex: 3 nested loops for GEMM
• Lots more optimized code• Most computer vendors provide own optimized versions• Motivates “automatic tuning” of the BLAS
CS267 Lecture 10
Summer School-Lecture 2
A brief history of (Dense) Linear Algebra software (4/5)• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
• Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1
• How do we reorganize GE to use BLAS-3 ? (details later)
• Contents of LAPACK (summary)• Algorithms we can turn into (nearly) 100% BLAS 3
– Linear Systems: solve Ax=b for x
– Least Squares: choose x to minimize ||r||2 ri2 where r=Ax-b
• Algorithms we can only make up to ~50% BLAS 3 (so far)– “Eigenproblems”: Find and x where Ax = x
– Singular Value Decomposition (SVD): ATAx=2x • Error bounds for everything• Lots of variants depending on A’s structure (banded, A=AT, etc)
• How much code? (Release 3.2, Nov 2008) (www.netlib.org/lapack)• Source: 1582 routines, 490K LOC, Testing: 352K LOC
• Ongoing development (at UCB, UTK and elsewhere)
CS267 Lecture 10
Summer School-Lecture 2
What could go into a linear algebra library?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all programming interfaces
Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc)
For all architectures and networks
Need to prioritize, automate!
Summer School-Lecture 2
A brief history of (Dense) Linear Algebra software (5/5)
• Is LAPACK parallel?• Only if the BLAS are parallel (possible in shared memory)
• ScaLAPACK – “Scalable LAPACK” (1995 – now)• For distributed memory – uses MPI• More complex data structures, algorithms than LAPACK
• Only (small) subset of LAPACK’s functionality available• Details later (projects!)
• All at www.netlib.org/scalapack
CS267 Lecture 10
Summer School-Lecture 267
Success Stories for Sca/LAPACK
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
• Widely used• Adopted by Mathworks, Cray,
Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, …
• >115M web hits(in 2010, 56M in 2006) @ Netlib (incl. CLAPACK, LAPACK95)
• New Science discovered through the solution of dense matrix systems
• Nature article on the flat universe used ScaLAPACK
• Other articles in Physics Review B that also use it
• 1998 Gordon Bell Prize• www.nersc.gov/news/reports/
newNERSCresults050703.pdf
Summer School-Lecture 2
So is there anything left to do?
• Yes!• Performance of LAPACK and ScaLAPACK much lower
than peak on new architectures• Multicore (eg PLASMA)• GPUs (eg MAGMA)• Heterogeneous clusters of these (MAGMA)• Cloud, Grid, …
• Major reason: almost no algorithms in Sca/LAPACK minimize communication
• Not enough to call BLAS3 in the inner loop of classical algorithms
• And there is still lots of missing functionality…
68
Summer School-Lecture 2
EXTRA SLIDES
69