how to compute and prove lower and upper bounds on the communication costs of your algorithm part...
TRANSCRIPT
![Page 1: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/1.jpg)
How to Compute and Prove
Lower and Upper Boundson the
Communication Costsof Your Algorithm
Part III: Graph analysis
Oded Schwartz
CS294, Lecture #10 Fall, 2011Communication-Avoiding Algorithms
www.cs.berkeley.edu/~odedsc/CS294
Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:
Graph expansion and communication costs of fast matrix multiplication.
![Page 2: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/2.jpg)
2
Previous talk on lower bounds Communication Lower Bounds:
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]
Proving that your algorithm/implementation is as good as it gets.
![Page 3: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/3.jpg)
3
Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a]Following [Irony,Toledo,Tiskin 04]
• BLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.
• Dense or sparse matricesIn sparse cases: bandwidth is a function NNZ.
• Bandwidth and latency.• Sequential, hierarchical, and
parallel – distributed and shared memory models.• Compositions of linear algebra operations.• Certain graph optimization problems
[Demmel, Pearson, Poloni, Van Loan, 11]• Tensor contraction
M
M
n3
P
M
M
n3
![Page 4: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/4.jpg)
4
Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij
other arguments)
But many algorithms just don’t fit the generalized form!
For example: Strassen’s fast matrix multiplication
![Page 5: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/5.jpg)
5
Beyond 3-nested loops
How about the communication costs of algorithmsthat have a more complex structure?
![Page 6: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/6.jpg)
6
Communication Lower Bounds
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]
Proving that your algorithm/implementation is as good as it gets.
![Page 7: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/7.jpg)
7
[Strassen 69]• Compute 2 x 2 matrix multiplication
using only 7 multiplications (instead of 8).• Apply recursively (block-wise)
M1 = (A11 + A22) (B11 + B22)M2 = (A21 + A22) B11
M3 = A11 (B12 - B22)M4 = A22 (B21 - B11)M5 = (A11+ A12) B22
M6 = (A21 - A11) (B11 + B12)M7 = (A12 - A22) (B21 + B22)
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
Recall: Strassen’s Fast Matrix Multiplication
C21 C22
C11 C12n/2
n/2 A21 A22
A11 A12
B21 B22
B11 B12
=
T(n) = 7T(n/2) + O(n2)
T(n) = (nlog27)
![Page 8: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/8.jpg)
8
Strassen-like algorithms
• Compute n0 x n0 matrix multiplication using only n0
0 multiplications (instead of n0
3).
• Apply recursively (block-wise)0 2.81 [Strassen 69] works fast in practice.2.79 [Pan 78]2.78 [Bini 79]2.55 [Schönhage 81]2.50 [Pan Romani,Coppersmith Winograd 84]2.48 [Strassen 87]2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach
T(n) = n00 T(n/n0) + O(n2)
T(n) = (n0)
n/n0
=
![Page 9: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/9.jpg)
9
New lower bound for Strassen’s fast matrix multiplication
[Ballard, Demmel, Holtz, S. 2011b]:The Communication bandwidth lower bound is
M
M
n7log2
M
M
n 0
M
M
n8log2
P
M
M
n7log2
P
M
M
n 0
P
M
M
n8log2
Strassen-like: Recall for cubic:For Strassen’s:
The parallel lower bounds applies to2D: M = (n2/P)2.5D: M = (c∙n2/P)
log2 7 log2 80
![Page 10: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/10.jpg)
10
For sequential? hierarchy?Yes, existing implementation do!
For parallel 2D? parallel 2.5D?Yes: new algorithms.
![Page 11: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/11.jpg)
11
Sequential and new 2D and 2.5D parallel Strassen-like algorithms
Sequential and Hierarchy cases: Attained by the natural recursive implementation.
Also: LU, QR,… (Black-box use of fast matrix multiplication)
[Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm.
Attains the lower bound.
New 2.5D parallel Strassen-like algorithm.c 0 /2-1 parallel communication speedup over 2D implementation (c ∙ 3n2 = M∙P)
[Ballard, Demmel, Holtz, S. 2011b]:This is as good as it gets.
![Page 12: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/12.jpg)
Implications for sequential architectural scaling
• Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n2 > M:
• Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth
• Time to multiply 2 largest locally storable square matrices exceeds latency
Strassen-like algs do fewer flops & less communication but are more demanding on the hardware.If 2, it is all about communication.
CA Matrix multiplication algorithm
Scaling BandwidthRequirement
Scaling LatencyRequirement
Classic M1/2 M3/2
Strassen-like M0/2-1 M0/2
![Page 13: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/13.jpg)
13
Let G = (V,E) be a d-regular graph
A is the normalized adjacency matrix, witheigenvalues 1 ≥ 2 ≥ … ≥ n
1 - max {2, | n|}
Thm: [Alon-Milman84, Dodziuk84, Alon86]12 2h
\V SS
Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]
Sd
SSEh V
SS
,min
2,
![Page 14: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/14.jpg)
RS
WS
S
14
The Computation Directed Acyclic Graph
Expansion (3rd approach)
Communication-cost is Graph-expansion
Input / OutputIntermediate valueDependency
\V SSV
![Page 15: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/15.jpg)
15
For a given run (Algorithm, Machine, Input)
1. Consider the computation DAG: G = (V, E)V = set of computations and inputsE = dependencies
2. Partition G into segments S of (M/2) vertices(correspond to time / location adjacency)
3. Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes.
4. The total communication BW isBW = BW of one segment #segments = (M) O(n) / (M/2) = (n / M/2 -1)
MM
MM
M
S
RS
WS
V
Expansion (3rd approach)
S1
S2
S3
Read
Read
Read
Read
Read
Read
Write
Write
Write
FLOP
FLOP
FLOP
FLOP
FLOP
FLOP
Tim
e
...
![Page 16: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/16.jpg)
16
Is it a Good Expander?
Break G into edge-disjoint graphs, corresponding to the algorithm on M1/2 M1/2 matrices.Consider the expansions of S in each part (they sum up).
S1
S2 S3
S5
S4
We need to show that M/2 expands to (M).
h(G(n)) = (M/ M/2) for n = (M1/2).
Namely, for every n, h(G(n)) = (n2/n) = ((4/7)lg n)
BW = (T(n)) h(G(M1/2))
BW = (T(n)) (G(M1/2))
Enlg n BEnlg nA
Declg nC
n2
n2
n
lg n
![Page 17: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/17.jpg)
17
What is the CDAG of Strassen’s algorithm?
![Page 18: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/18.jpg)
18
M1 = (A11 + A22) (B11 + B22)M2 = (A21 + A22) B11
M3 = A11 (B12 - B22)M4 = A22 (B21 - B11)M5 = (A11+ A12) B22
M6 = (A21 - A11) (B11 + B12)M7 = (A12 - A22) (B21 + B22)
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
The DAG of Strassen, n = 2
`
7 5 4 1 3 2 6
1,1 1,2 2,1 2,2
1,1 1,2 2,1 2,21,1 1,2 2,1 2,2
Enc1A
Dec1C
Enc1B
![Page 19: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/19.jpg)
`
19
The DAG of Strassen, n=4
Dec1C1,1 1,2 2,1 2,2
7 5 4 1 3 2 6
One recursive level:
• Each vertex splits into four.
• Multiply blocks
Enc1 BEnc1A
Dec1C
Enc1A Enc1B
![Page 20: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/20.jpg)
20
Enclg n BEnclg nA
Declg nC
n2
n2
n
lg n
Dec1C
The DAG of Strassen: further recursive steps
1,1 1,2 2,1 2,2
Recursive construction
Given DeciC, Construct Deci+1C:
1. Duplicate 4 times
2. Connect with a cross-layer of Dec1C
![Page 21: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/21.jpg)
21
Enlg nBEnlg nA
Declg nC
n2
n2
n
lg n
The DAG of Strassen
1. Compute weighted sums of A’s elements.
2. Compute weighted sums of B’s elements.
3. Compute multiplications m1,m2,…,m.
4. Compute weighted sums of m1,m2,…,m to obtain C.
A B
C
![Page 22: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/22.jpg)
22
Expansion of a Segment
Two methods to compute the expansion of the recursively constructed graph:
• Combinatorial- estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08])
or• Spectral
- compute the edge expansion via the spectral-gap(in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00])
![Page 23: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/23.jpg)
23
Expansion of a Segment
Main technical challenges:
• Two types of vertices: with/without recursion.
• The graph is not regular.
`
7 5 4 1 3 2 6
1,1 1,2 2,1 2,2
1,1 1,2 2,1 2,21,1 1,2 2,1 2,2
Enc1A
Dec1C
Enc1B
![Page 24: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/24.jpg)
24
Estimating the edge expansion- Combinatorially
SkS1
lg 1k M
S3S2 M M
• Dec1C is a consistency gadget: Mixed pays 1/12 of its edges.
• The fraction of S vertices is consistent between the 1st level and the four 2nd levels (deviations pay linearly).
In S
Not in S
Mixed
![Page 25: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/25.jpg)
25
Communication Lower Bounds
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b]
Proving that your algorithm/implementation is as good as it gets.
![Page 26: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/26.jpg)
26
Open Problems
Find algorithms that attain the lower bounds:• Sparse matrix algorithms• for sequential and parallel models• that auto-tune or are cache oblivious
Address complex heterogeneous hardware:• Lower bounds and algorithms
[Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]
Extend the techniques to other algorithm and algorithmic tools:• Non-uniform recursive structure
Characterize a communication lower bound for a problem rather than for an algorithm.
\V SS
?
![Page 27: How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e305503460f94b20e3b/html5/thumbnails/27.jpg)
How to Compute and Prove
Lower Boundson the
Communication Costsof Your Algorithm
Part III: Graph analysis
Oded Schwartz
CS294, Lecture #10 Fall, 2011Communication-Avoiding Algorithms
Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:
Graph expansion and communication costs of fast matrix multiplication.
Thank you!