lecture 15: dense linear algebra ii - cornell universitybindel/class/cs5220-f11/slides/lec15.pdf ·...
TRANSCRIPT
![Page 1: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/1.jpg)
Lecture 15:Dense Linear Algebra II
David Bindel
19 Oct 2011
![Page 2: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/2.jpg)
Logistics
I Matrix multiply is gradedI HW 2 logistics
I Viewer issue was a compiler bug – change Makefile toswitch gcc version or lower optimization. Or grab therevised tarball.
I It is fine to use binning vs. quadtrees
![Page 3: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/3.jpg)
Review: Parallel matmul
I Basic operation: C = C + ABI Computation: 2n3 flopsI Goal: 2n3/p flops per processor, minimal communicationI Two main contenders: SUMMA and Cannon
![Page 4: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/4.jpg)
Outer product algorithm
Serial: Recall outer product organization:
for k = 0:s-1C += A(:,k)*B(k,:);
end
Parallel: Assume p = s2 processors, block s × s matrices.For a 2× 2 example:[
C00 C01C10 C11
]=
[A00B00 A00B01A10B00 A10B01
]+
[A01B10 A01B11A11B10 A11B11
]
I Processor for each (i , j) =⇒ parallel work for each k !I Note everyone in row i uses A(i , k) at once,
and everyone in row j uses B(k , j) at once.
![Page 5: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/5.jpg)
Parallel outer product (SUMMA)
for k = 0:s-1for each i in parallelbroadcast A(i,k) to row
for each j in parallelbroadcast A(k,j) to col
On processor (i,j), C(i,j) += A(i,k)*B(k,j);end
If we have tree along each row/column, thenI log(s) messages per broadcastI α+ βn2/s2 per messageI 2 log(s)(αs + βn2/s) total communicationI Compare to 1D ring: (p − 1)α+ (1− 1/p)n2β
Note: Same ideas work with block size b < n/s
![Page 6: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/6.jpg)
SUMMA
![Page 7: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/7.jpg)
SUMMA
![Page 8: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/8.jpg)
SUMMA
![Page 9: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/9.jpg)
Parallel outer product (SUMMA)
If we have tree along each row/column, thenI log(s) messages per broadcastI α+ βn2/s2 per messageI 2 log(s)(αs + βn2/s) total communication
Assuming communication and computation can potentiallyoverlap completely, what does the speedup curve look like?
![Page 10: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/10.jpg)
Reminder: Why matrix multiply?
LAPACK
BLAS
LAPACK structure
Build fast serial linear algebra (LAPACK) on top of BLAS 3.
![Page 11: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/11.jpg)
Reminder: Why matrix multiply?
ScaLAPACK structure
BLACS
PBLAS
ScaLAPACK
LAPACK
BLAS MPI
ScaLAPACK builds additional layers on same idea.
![Page 12: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/12.jpg)
Reminder: Evolution of LU
On board...
![Page 13: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/13.jpg)
Blocked GEPP
Find pivot
![Page 14: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/14.jpg)
Blocked GEPP
Swap pivot row
![Page 15: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/15.jpg)
Blocked GEPP
Update within block
![Page 16: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/16.jpg)
Blocked GEPP
Delayed update (at end of block)
![Page 17: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/17.jpg)
Big idea
I Delayed update strategy lets us do LU fastI Could have also delayed application of pivots
I Same idea with other one-sided factorizations (QR)I Can get decent multi-core speedup with parallel BLAS!
... assuming n sufficiently large.
There are still some issues left over (block size? pivoting?)...
![Page 18: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/18.jpg)
Explicit parallelization of GE
What to do:I Decompose into work chunksI Assign work to threads in a balanced wayI Orchestrate the communication and synchronizationI Map which processors execute which threads
![Page 19: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/19.jpg)
Possible matrix layouts
1D column blocked: bad load balance
0 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 2
![Page 20: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/20.jpg)
Possible matrix layouts
1D column cyclic: hard to use BLAS2/3
0 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 2
![Page 21: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/21.jpg)
Possible matrix layouts
1D column block cyclic: block column factorization a bottleneck
0 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 1
![Page 22: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/22.jpg)
Possible matrix layouts
Block skewed: indexing gets messy
0 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 22 2 2 0 0 0 1 1 12 2 2 0 0 0 1 1 12 2 2 0 0 0 1 1 11 1 1 2 2 2 0 0 01 1 1 2 2 2 0 0 01 1 1 2 2 2 0 0 0
![Page 23: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/23.jpg)
Possible matrix layouts
2D block cyclic:
0 0 1 1 0 0 1 10 0 1 1 0 0 1 12 2 3 3 2 2 3 32 2 3 3 2 2 3 30 0 1 1 0 0 1 10 0 1 1 0 0 1 12 2 3 3 2 2 3 32 2 3 3 2 2 3 3
![Page 24: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/24.jpg)
Possible matrix layouts
I 1D column blocked: bad load balanceI 1D column cyclic: hard to use BLAS2/3I 1D column block cyclic: factoring column is a bottleneckI Block skewed (a la Cannon): just complicatedI 2D row/column block: bad load balanceI 2D row/column block cyclic: win!
![Page 25: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/25.jpg)
Distributed GEPP
Find pivot (column broadcast)
![Page 26: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/26.jpg)
Distributed GEPP
Swap pivot row within block column + broadcast pivot
![Page 27: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/27.jpg)
Distributed GEPP
Update within block column
![Page 28: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/28.jpg)
Distributed GEPP
At end of block, broadcast swap info along rows
![Page 29: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/29.jpg)
Distributed GEPP
Apply all row swaps to other columns
![Page 30: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/30.jpg)
Distributed GEPP
Broadcast block LII right
![Page 31: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/31.jpg)
Distributed GEPP
Update remainder of block row
![Page 32: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/32.jpg)
Distributed GEPP
Broadcast rest of block row down
![Page 33: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/33.jpg)
Distributed GEPP
Broadcast rest of block col right
![Page 34: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/34.jpg)
Distributed GEPP
Update of trailing submatrix
![Page 35: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/35.jpg)
Cost of ScaLAPACK GEPP
Communication costs:I Lower bound: O(n2/
√P) words, O(
√P) messages
I ScaLAPACK:I O(n2 log P/
√P) words sent
I O(n log p) messagesI Problem: reduction to find pivot in each column
I Recent research on stable variants without partial pivoting
![Page 36: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/36.jpg)
What if you don’t care about dense Gaussian elimination?Let’s review some ideas in a different setting...
![Page 37: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/37.jpg)
Floyd-Warshall
Goal: Find shortest path lengths between all node pairs.
Idea: Dynamic programming! Define
d (k)ij = shortest path i to j with intermediates in {1, . . . , k}.
Thend (k)
ij = min(
d (k−1)ij ,d (k−1)
ik + d (k−1)kj
)and d (n)
ij is the desired shortest path length.
![Page 38: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/38.jpg)
The same and different
Floyd’s algorithm for all-pairs shortest paths:
for k=1:nfor i = 1:nfor j = 1:nD(i,j) = min(D(i,j), D(i,k)+D(k,j));
Unpivoted Gaussian elimination (overwriting A):
for k=1:nfor i = k+1:nA(i,k) = A(i,k) / A(k,k);for j = k+1:nA(i,j) = A(i,j)-A(i,k)*A(k,j);
![Page 39: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/39.jpg)
The same and different
I The same: O(n3) time, O(n2) spaceI The same: can’t move k loop (data dependencies)
I ... at least, can’t without care!I Different from matrix multiplication
I The same: x (k)ij = f
(x (k−1)
ij ,g(
x (k−1)ik , x (k−1)
kj
))I Same basic dependency pattern in updates!I Similar algebraic relations satisfied
I Different: Update to full matrix vs trailing submatrix
![Page 40: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/40.jpg)
How far can we get?
How would weI Write a cache-efficient (blocked) serial implementation?I Write a message-passing parallel implementation?
The full picture could make a fun class project...
![Page 41: Lecture 15: Dense Linear Algebra II - Cornell Universitybindel/class/cs5220-f11/slides/lec15.pdf · I Goal: 2n3=p flops per processor, minimal communication I Two main contenders:](https://reader034.vdocument.in/reader034/viewer/2022050502/5f9419ee19cc1d6f4a0b75cd/html5/thumbnails/41.jpg)
Onward!
Next up: Sparse linear algebra and iterative solvers!