![Page 1: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/1.jpg)
Compiling to Avoid Communication
Kathy Yelick
Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory
EECS Professor, UC Berkeley
![Page 2: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/2.jpg)
NERSC Systems and Software Have to Support All of Science
NERSC computing for science • 4500 users, 500 projects • ~65% from universities, 30% labs • 1500 publications per year!
Systems designed for science • 1.3PF Petaflop Cray system, Hopper • 8 PB filesystem; 250 PB archive • Several systems for genomics,
astronomy, visualization, etc. ~650 applications with these programming models
• 75% Fortran, 45% C/C++, 10% Python • 85% MPI, 25% with OpenMP • 10% PGAS or global objects
These are self-reported, likely low 2
![Page 3: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/3.jpg)
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
1.E+17
1.E+18
1990 1995 2000 2005 2010 2015 2020
Computational Science has Moved through Difficult Technology Transitions
Application Performance Growth (Gordon Bell Prizes)
3
Attack of the “killer micros”
Attack of the “killer cellphones”?
The rest of the computing world gets parallelism
![Page 4: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/4.jpg)
Essentially, all models are wrong, but some are useful.
-- George E. Box, Statistician
4
![Page 5: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/5.jpg)
Computing Performance Improvements will be Harder than Ever
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands)
Moore’s Law continues, but power limits performance growth. Parallelism is used instead.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands) Frequency (MHz) Power (W)
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands) Frequency (MHz) Power (W) Cores
5
![Page 6: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/6.jpg)
Energy Efficient Computing is Key to Performance Growth
goal
usual scaling
2005 2010 2015 2020
6
At $1M per MW, energy costs are substantial • 1 petaflop in 2010 used 3 MW • 1 exaflop in 2018 would use 100+ MW with “Moore’s Law” scaling
This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.
![Page 7: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/7.jpg)
Communication is Expensive in Time Communication cost has two components:
• Bandwidth: # of words move / bandwidth • Latency: # messages * latency
“Communication” in memory hierarchy and network Alas, things are bad and getting worse:
flop_time << 1/bandwidth << latency
Annual improvements [FOSC] Flop_time Bandwidth Latency
59%
Network 26% 15% DRAM 23% 5%
Hard to change: Latency is physics; bandwidth is money!
7
![Page 8: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/8.jpg)
Communication is Expensive in Energy
1
10
100
1000
10000
Pico
Joul
es
now
2018
Intranode/MPI Communication
On-chip / CMP communication
Intranode/SMP Communication
Communication (any off-chip data access including to DRAM) is expensive 8
![Page 9: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/9.jpg)
The Memory Wall Swamp
9
Multicore didn’t cause this, but kept the bandwidth gap growing.
![Page 10: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/10.jpg)
Obama for Communication-Avoiding Algorithms
“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.”
FY 2012 Congressional Budget Request, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.
10
![Page 11: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/11.jpg)
Lessons #1: Compress Data Structures
This is a hard one for compilers....
11
![Page 12: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/12.jpg)
Compression of Meta Data is Important
12
(1) Estimate Fill(r,c)
Fill(r,c) = (true nonzeros)+ ( fill in explicit zeros)(true nonzeros)
#(true nonzeros) + #(filled in explicit zeros)
#(true nonzeros)
(2) Select RB(r,c) with highest Effective_MFlops(r,c)
Effective_Mflops(r,c) = Mflops(r,c)Fill(r,c) Extra flops,
But still may finish sooner.
Select Register Block size r and c
This is a hard problem for compilers on Sparse matrices!
![Page 13: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/13.jpg)
Offline Analysis to Estimate Mflop/sec
13
Estimate Mflop/s rate using a dense matrix in sparse format
Dense: MFlops(r,c) / Tsopf: Fill(r,c) = Effective_MFlops(r,c)
Selected RB(5x7) with a sample dense matrix
c (num. of cols) c (num. of cols) c (num. of cols)
r (nu
m. o
f row
s)
![Page 14: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/14.jpg)
14
Auto-tuned SpMV Performance
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Results from Williams, Shalf, Oliker, Yelick
![Page 15: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/15.jpg)
15
William’s Roofline Performance Model
peak DP
mul / add imbalance
w/out SIMD
w/out ILP
0.5
1.0
1/8 actual flop:byte ratio
atta
inab
le G
flop/
s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Generic Machine • Flat top is flop limit • Slanted is bandwidth limit • X-Axis is arithmetic
intensity of algorithm
• Bandwidth-reducing optimizations, such as better cache re-use (tiling), compression improve intensity
Model by Sam Williams
![Page 16: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/16.jpg)
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 32 1/8 1/4
1/2 1/32
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 32 1/8 1/4
1/2 1/32
single-precision peak
double-precision peak
single-precision peak
double-precision peak
RTM/wave eqn.
RTM/wave eqn.
7pt Stencil 27pt Stencil
Xeon X5550 (Nehalem) NVIDIA C2050 (Fermi)
DP add-only
DP add-only
SpMV SpMV
7pt Stencil
27pt Stencil DGEMM
DGEMM
GTC/chargei
GTC/pushi
GTC/chargei
GTC/pushi
Autotuning Gets Kernel Performance Near Optimal
• Roofline model captures bandwidth and computation limits • Autotuning gets kernels near the roof
Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,… 16
Algorithmic intensity: Flops/Word Algorithmic intensity: Flops/Word
![Page 17: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/17.jpg)
Lessons #2: Target Higher Level Loops
Harder than inner loops....
17
![Page 18: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/18.jpg)
Avoiding Communication in Iterative Solvers
• Consider Sparse Iterative Methods for Ax=b – Krylov Subspace Methods: GMRES, CG,… – Can we lower the communication costs?
• Latency of communication, i.e., reduce # messages by computing multiple reductions at once
• Bandwidth to memory hierarchy, i.e., compute Ax, A2x, … Akx with one read of A
• Solve time dominated by: – Sparse matrix-vector multiple (SPMV)
• Which even on one processor is dominated by “communication” time to read the matrix
– Global collectives (reductions) • Global latency-limited Joint work with Jim
Demmel, Mark Hoemman, Marghoob Mohiyuddin 18
![Page 19: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/19.jpg)
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx]
• Idea: pick up part of A and x that fit in fast memory,
compute each of k products • Example: A tridiagonal, n=32, k=3 • Works for any “well-partitioned” A
19
![Page 20: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/20.jpg)
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Sequential Algorithm
• Example: A tridiagonal, n=32, k=3 • Saves bandwidth (one read of A&x for k steps) • Saves latency (number of independent read events)
Step 1 Step 2 Step 3 Step 4
20
![Page 21: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/21.jpg)
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Parallel Algorithm
• Example: A tridiagonal, n=32, k=3 • Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
21
![Page 22: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/22.jpg)
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Parallel Algorithm
• Example: A tridiagonal, n=32, k=3 • Each processor works on (overlapping) trapezoid • Saves latency (# of messages); Not bandwidth But adds redundant computation
Proc 1 Proc 2 Proc 3 Proc 4
22
![Page 23: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/23.jpg)
Matrix Powers Kernel on a General Matrix
• Saves communication for “well partitioned” matrices • Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p)
23
Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin
For implicit memory management (caches) uses a TSP algorithm for layout
![Page 24: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/24.jpg)
Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax)
Speedups on Intel Clovertown (8 core)
Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick 24
![Page 25: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/25.jpg)
Minimizing Communication of GMRES to solve Ax=b
• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A ·∙ v(i-‐1) … SpMV MGS(w, v(0),…,v(i-‐1)) update v(i), H endfor solve LSQ problem with H
CommunicaIon-‐avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” build H from R solve LSQ problem with H
Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k
• Oops – W from power method, precision lost!
25
![Page 26: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/26.jpg)
Matrix Powers Kernel (and TSQR) in GMRES
26
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRESCA-GMRES (Monomial basis)
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)
![Page 27: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/27.jpg)
Communication-Avoiding Krylov Method (GMRES)
Performance on 8 core Clovertown
27
![Page 28: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/28.jpg)
CA-Krylov Methods Summary and Future Work
• The Communication-Avoidance works – Provably optimal – Faster in practice
• Ongoing work for Preconditioning – Handle “hard” matrices that partition
poorly (high surface to volume) – Idea: separate out dense rows (HSS
matrices) – [Erin Carson, Nick Knight, Jim Demmel]
28
![Page 29: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/29.jpg)
Lessons #3: Understand Numerics (Or work with someone who does)
Don’t be afraid to change to a “different” right answer
29
![Page 30: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/30.jpg)
Lessons #4: Never Waste Fast Memory
Don’t get hung up on the “owner computes” rule.
30
![Page 31: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/31.jpg)
Beyond Domain Decomposition: 2.5D Matrix Multiply
0
20
40
60
80
100
256 512 1024 2048
Perc
enta
ge o
f m
ach
ine p
eak
#nodes
2.5D MM on BG/P (n=65,536)
2.5D Broadcast-MM2.5D Cannon-MM2D MM (Cannon)
ScaLAPACK PDGEMM Perfect Strong Scaling
• Conventional “2D algorithms” use P1/2 x P1/2 mesh and minimal memory • New “2.5D algorithms” use (P/c)1/2 x (P/c)1/2 x c1/2 mesh and c-fold memory
• Matmul sends c1/2 times fewer words – lower bound • Matmul sends c3/2 times fewer messages – lower bound
Word by Edgar Solomonik and Jim Demmel
Surprises: • Even Matrix Multiply had room for
improvement • Idea: make copies of C matrix (as in prior 3D
algorithm, but not as many) • Result is provably optimal in communication Lesson: Never waste fast memory Can we generalize for compiler writers?
31
![Page 32: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/32.jpg)
Deconstructing 2.5D Matrix Multiply Solomonick & Demmel
x
z
z
y
x y • Tiling the iteration space
• 2D algorithm: never chop k dim • 2.5 or 3D: Assume + is
associative; chop k, which is à replication of C matrix
k
j
i Matrix Multiplication code has a 3D iteration space Each point in the space is a constant computation (*/+)
for i for j for k
B[k,j] … A[i,k] … C[i,j] … 32
![Page 33: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/33.jpg)
Lower Bound Idea on C = A*B Iromy, Toledo, Tiskin
33"
x
z
z
y
x y
Cubes in black box with side lengths x, y and z = Volume of black box = x*y*z = (#A□s * #B□s * #C□s )1/2
= ( xz * zy * yx)1/2
k
(i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) * area(B shadow) * area(C shadow)) 1/2
“A shadow”
“C shadow”
j
i
![Page 34: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/34.jpg)
Lessons #5: Understand Theory
Lower bounds help identify optimizations.
34
![Page 35: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/35.jpg)
Traditional (Naïve n2) Nbody Algorithm (using a 1D decomposition)
• Given n particles and p processors, size M memory
• Each processor has n/p particles • Algorithm: shift copy of particles to the left p
times, calculating all pairwise forces • Computation cost: n2/p • Communication cost: O(p) messages, O(n)
words 35
............ ............ ............ ............ ............ ............ ............ ............ p
![Page 36: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/36.jpg)
Communication Avoiding Version (using a “1.5D” decomposition)
Solomonik,Yelick
• Divide p into c groups. Replicate particles within group. – First row responsible for updating all by orange, second all by green,…
• Algorithm: shift copy of n/(p*c) particles to the left – Combine with previous data before passing further level (log steps)
• Reduce across c to produce final value for each particle
• Total Computation: O(n2/p); • Total Communication: O(log(p/c) + log c) messages, O(n*(c/p+1/c)) words
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
Limit: c ≤ p1/2
c
p/c
36
![Page 37: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/37.jpg)
In theory there is no difference between theory and practice, but in
practice there is.
-- Jan L. A. van de Snepscheut, Computer Scientist or -- Yogi Berra, Baseball player and manager
37
![Page 38: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/38.jpg)
Performance Results
• Over 2x speedup for 24K cores relative to 1D algorithm • In general, “.5D” replication best for small problems on big machines Driscoll, Georganas, Koanantakool, Solomonik, Yelick (unpublished)
38
10.84 0
0.04
0.08
0.12
0.16
0.2
0.24
96 768 6144 24576
GFlo
ps/
core
Number of cores
Strong scaling for All-Pairs (196,608 particles)
Ideal
c = 16
c = 8
c = 4
c = 2
c = 1
![Page 39: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/39.jpg)
Have We Seen this Idea Before?
• These algorithms also maximize parallelism beyond “domain decomposition” – SIMD machine days
• Automation depends on associative operator for updates (e.g., M. Wolfe)
• Also used for “synchronization avoidance” in Particle-in-Cell code (Madduri, Su, Oliker, Yelick) – Replicate and reduce optimization given p copies – Useful on vectors / GPUs
39
![Page 40: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/40.jpg)
Lessons #6: Aggregate Communication
Pack messages to better amortize per-message communication costs.
40
![Page 41: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/41.jpg)
Lessons #7: Overlap and Pipeline Communication
Sometimes runs contrary to lesson #6 on aggregation
41
![Page 42: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/42.jpg)
Communication Strategies for 3D FFT
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea!
chunk = all rows with same destination
pencil = 1 row
• Three approaches: – Chunk:
• Wait for 2nd dim FFTs to finish • Minimize # messages
– Slab: • Wait for chunk of rows destined
for 1 proc to finish • Overlap with computation
– Pencil: • Send each row as it completes • Maximize overlap and • Match natural layout slab = all rows in a single plane with
same destination
![Page 43: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/43.jpg)
NAS FT Variants Performance Summary
• Slab is always best for MPI; small message cost too high • Pencil is always best for UPC; more overlap
0
200
400
600
800
1000
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions
Best NAS Fortran/MPIBest MPIBest UPC
0
100
200
300
400
500
600
700
800
900
1000
1100
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFl
ops
per T
hrea
d
Best NAS Fortran/MPIBest MPI (always Slabs)Best UPC (always Pencils)
.5 Tflops
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs 64 256 256 512 256 512
MFl
ops
per T
hrea
d
Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)
![Page 44: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/44.jpg)
Lessons #8: Avoid Synchronization, a Particular form Communication
Use One-sided Communication
44
![Page 45: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/45.jpg)
Avoiding Synchronization in Communication
• Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data – Wildly popular in HPC, but cumbersome in some applications – Couples data transfer with synchronization
• Using global address space decouples synchronization – Pay for what you need! – Note: Global Addressing ≠ Cache Coherent Shared memory
address
message id
data payload
data payload one-sided put message
two-sided message
network interface
memory
host CPU
Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group
![Page 46: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/46.jpg)
Performance Advantage of One-Sided Communication
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
6.6
4.5
9.5
18.5
24.2
13.5
17.8
8.3
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Ro
un
dtr
ip L
ate
ncy (
usec)
MPI ping-pong
GASNet put+sync
• The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc)
• This is faster for pure data transfers than two-sided send/receive Flood Bandwidth for 4KB messages
547
420
190
702
152
252
750
714231
763223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Pe
rce
nt
HW
pe
ak
MPIGASNet
![Page 47: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/47.jpg)
UPC at Scale on the MILC (QCD) App
0"
20000"
40000"
60000"
80000"
100000"
120000"
512" 1024" 2048" 4096" 8192" 16384" 32768"
Sites&/&Second
&
Number&of&Cores&
UPC"Opt"
MPI"
UPC"Naïve"
47
Shan, Austin, Wright, Strohmaier, Shalf, Yelick (unpublished)
![Page 48: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/48.jpg)
Lessons #9: “Strength-Reduce” Synchronization
Replace global barriers with subset barriers, point-to-point synchronization,
and event-driven execution.
48
![Page 49: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/49.jpg)
49
Avoid Synchronization from Applications
Cholesky 4 x 4
QR 4 x 4
Computations as DAGs View parallel executions as the directed acyclic graph of the computation
Slide source: Jack Dongarra
![Page 50: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/50.jpg)
Event Driven LU in UPC
• Assignment of work is static; schedule is dynamic • Ordering needs to be imposed on the schedule
– Critical path operation: Panel Factorization • General issue: dynamic scheduling in partitioned memory
– Can deadlock in memory allocation – “memory constrained” lookahead
some edges omitted
![Page 51: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/51.jpg)
51
DAG Scheduling Outperforms Bulk-Synchronous Style
UPC vs. ScaLAPACK
0
20
40
60
80
2x 4 pr oc g r i d 4x 4 pr oc g r i d
GFlop
s
ScaLAPACK
UPC
UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding – New problem in partitioned memory: allocator deadlock – Can run on of memory locally due tounlucky execution order
PLASMA on shared memory UPC on partitioned memory
PLASMA by Dongarra et al; UPC LU joint with Parray Husbands
![Page 52: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/52.jpg)
Another reason to avoid synchronization
• Processors do not run at the same speed – Never did, due to caches – Power / temperature management makes this worse
60%
52
![Page 53: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/53.jpg)
Lessons #10: Combine Techniques
These are no entirely orthogonal, and there are many interesting trade-offs.
53
![Page 54: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/54.jpg)
Communication Avoidance (2.5) and Overlapped (with UPC) Matrix Multiply
0
10
20
30
40
50
60
70
1,536 6,144 24,576
Perc
enta
ge o
f M
ach
ine P
eak
Number of cores
SUMMA on Hopper (n=32,768)
2.5D-overlp2.5D2D-overlp2D
54 Georganas, Gonzalez-Domınguezy, Solomonik, Zheng, Tourinoy, Yelick (to appear at SC12)
![Page 55: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/55.jpg)
Lessons in Communication Avoidance
1. Compress Data Structures 2. Target Higher Level Loops 3. Understand theory / numerics 4. Replicate data 5. Understand theory / lower bounds 6. Aggregate communication 7. Overlap communication 8. Use one-sided communication 9. Synchronization strength reduction 10. Combine the techniques
55
![Page 56: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/56.jpg)
Challenges in Exascale Computing
There are many exascale challenges: • Scaling • Synchronization, • Dynamic system behavior • Irregular algorithms • Resilience But don’t forget what’s really important
56
![Page 57: Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with](https://reader033.vdocument.in/reader033/viewer/2022060819/6098918445350a211b46a3b0/html5/thumbnails/57.jpg)
Communication Hurts!
57