polly-acc: transparent compilation to heterogeneous hardware...
TRANSCRIPT
![Page 1: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/1.jpg)
spcl.inf.ethz.ch
@spcl_eth
Polly-ACC: Transparent Compilation to Heterogeneous Hardware
Torsten Hoefler (with Tobias Grosser)
1
![Page 2: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/2.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
Evading various “ends” – the hardware view
![Page 3: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/3.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
row = 0;
output_image_ptr = output_image;
output_image_ptr += (NN * dead_rows);
for (r = 0; r < NN - KK + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < NN - KK + 1; c++) {
input_image_ptr= input_image;
input_image_ptr+= (NN * row);
kernel_ptr = kernel;
S0: *output_image_offset = 0;
for (i = 0; i < KK; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < KK; j++) {
S1: temp1 = *input_image_offset++;
S1: temp2 = *kernel_offset++;
S1: *output_image_offset += temp1 * temp2;
}
kernel_ptr += KK;
input_image_ptr+= NN;
}
S2: *output_image_offset = ((*output_image_offset)/
normal_factor);
output_image_offset++ ;
col++;
}
output_image_ptr += NN;
row++;
}
}
Fortran
C/C++CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core CPU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
GP
UG
PU
Accelerator
Sequential Software Parallel Hardware
![Page 4: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/4.jpg)
spcl.inf.ethz.ch
@spcl_eth
Non-Goal:
Algorithmic Changes
4
Design Goals
Automatic
“Regression Free” High Performance
Automatic accelerator mapping
-
How close can we get?
![Page 5: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/5.jpg)
spcl.inf.ethz.ch
@spcl_eth
Tool: Polyhedral Modeling
Iteration Space
0 1 2 3 4 5
j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
(i, j) = (0,0)(1,0)(1,1)(2,0)(2,1)
Program Code
(2,2)(3,0)(3,1)(3,2)(3,3)(4,0)(4,1)(4,2)(4,3)(4,4)
for (i = 0; i <= N; i++)
for (j = 0; j <= i; j++)
S(i,j);
4
Polly -- Performing Polyhedral
Optimizations on a Low-Level
Intermediate Representation
Tobias Grosser et al,
Parallel Processing Letter, 2012
![Page 6: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/6.jpg)
spcl.inf.ethz.ch
@spcl_eth
6
Mapping Computation to Device
0
1
2
0
1
2
0 1 2 3 0 1 2 3
0
0 1
1
Device Blocks & ThreadsIteration Space
𝐵𝐼𝐷 = { 𝑖, 𝑗 →𝑖
4% 2,
𝑗
3% 2 }
0
1
10
i
j
𝑇𝐼𝐷 = { 𝑖, 𝑗 → 𝑖 % 4, 𝑗 % 3 }
![Page 7: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/7.jpg)
spcl.inf.ethz.ch
@spcl_eth
7
Memory Hierarchy of a Heterogeneous System
![Page 8: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/8.jpg)
spcl.inf.ethz.ch
@spcl_eth
8
Host-device date transfers
![Page 9: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/9.jpg)
spcl.inf.ethz.ch
@spcl_eth
9
Host-device date transfers
![Page 10: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/10.jpg)
spcl.inf.ethz.ch
@spcl_eth
10
Mapping onto fast memory
![Page 11: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/11.jpg)
spcl.inf.ethz.ch
@spcl_eth
11
Mapping onto fast memory
Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM
Transactions on Architecture and Code Optimization, 2013
![Page 12: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/12.jpg)
spcl.inf.ethz.ch
@spcl_eth
Profitability Heuristic
Trivial
Unsuitable
Insufficient Compute
static dynamic
Modeling Execution
All
Loop
Nests
GPU
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
![Page 13: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/13.jpg)
spcl.inf.ethz.ch
@spcl_eth
13
From kernels to program – data transfers
void heat(int n, float A[n], float hot, float cold) {
float B[n] = {0};
initialize(n, A, cold);
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
}
}
![Page 14: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/14.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
Data Transfer – Per Kernel
Host Memory
initialize()
setCenter()
average()
average()
average()
D → 𝐻
D → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
tim
e
𝐻 → 𝐷 𝐷 → 𝐻
𝐻 → 𝐷 𝐷 → 𝐻
Device Memory
void heat(int n, float A[n], ...) {
initialize(n, A, cold);
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
} }
![Page 15: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/15.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
Data Transfer – Inter Kernel Caching
Host Memory
𝐷 → 𝐻
Host Memory
initialize()
setCenter()
average()
average()
average()
tim
e
𝐻 → 𝐷
Device Memory
void heat(int n, float A[n], ...) {
initialize(n, A, cold);
setCenter(n, A, hot, n/4);
for (int t = 0; t < T; t++) {
average(n, A, B);
average(n, B, A);
printf("Iteration %d done", t);
} }
![Page 16: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/16.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
EvaluationEvaluation
Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler)
Mobile: 4 core Haswell NVIDIA GT730M (Kepler)
![Page 17: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/17.jpg)
spcl.inf.ethz.ch
@spcl_eth
1
10
100
1000
10000
SCoPs 0-dim 1-dim 2-dim 3-dim
No Heuristics Heuristics
17
LLVM Nightly Test Suite
# C
om
pute
Regio
ns /
Kern
els
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
![Page 18: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/18.jpg)
spcl.inf.ethz.ch
@spcl_eth
18
Some results: Polybench 3.2
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
arithmean: ~30x
geomean: ~6x
Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
Speedup over icc –O3
![Page 19: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/19.jpg)
spcl.inf.ethz.ch
@spcl_eth
0:00
1:12
2:24
3:36
4:48
6:00
7:12
8:24
Mobile Workstation
icc icc -openmp clang Polly ACC
19
Compiles all of SPEC CPU 2006 – Example:
LBM
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
Ru
ntim
e (
m:s
)
Xeon E5-2690 (10 cores, 0.5Tflop) vs.
Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
essentially my 4-core x86 laptop
with the (free) GPU that’s in there
~20%
~4x
![Page 20: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/20.jpg)
spcl.inf.ethz.ch
@spcl_eth
20
Cactus ADM (SPEC 2006)
Work
sta
tion
Mobile
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
![Page 21: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/21.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
Cactus ADM (SPEC 2006) - Data Transfer
Work
sta
tion
Mobile
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
![Page 22: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/22.jpg)
spcl.inf.ethz.ch
@spcl_eth
Polly-ACC
22
Automatic
“Regression Free” High Performance
http://spcl.inf.ethz.ch/Polly-ACC
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
![Page 23: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/23.jpg)
spcl.inf.ethz.ch
@spcl_eth
Unfortunately not …
Limited to affine code regions
Maybe generalizes to control-restricted programs
No distributed anything!!
Good news:
Much of traditional HPC fits that model
Infrastructure is coming along
Bad news:
Modern data-driven HPC and Big Data fits less well
Need a programming model for distributed heterogeneous machines!
23
Brave new compiler world!?
![Page 24: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/24.jpg)
spcl.inf.ethz.ch
@spcl_eth
How do we program GPUs today?
ld
ld
ld
ld
st
st
st
st
device compute core active thread instruction latency
CUDA• over-subscribe hardware• use spare parallel slack for latency
hiding
MPI• host controlled• full device
synchronization
…
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 25: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/25.jpg)
spcl.inf.ethz.ch
@spcl_eth
Latency hiding at the cluster level?
ld
ld
ld
ld
device compute core active thread instruction latency
dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding
st
put
st
put
ld
ld
ld
ld
st
st
st
st
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 26: Polly-ACC: Transparent Compilation to Heterogeneous Hardware …htor.inf.ethz.ch/.../img/hoefler-grosser-polly-acc-sc16.pdf · 2018-03-05 · Good news: Much of traditional HPC fits](https://reader033.vdocument.in/reader033/viewer/2022050105/5f4372e50de074136b16f096/html5/thumbnails/26.jpg)
spcl.inf.ethz.ch
@spcl_eth
Tobias Gysi, Jeremiah Baer, TH: “dCUDA: Hardware Supported
Overlap of Computation and Communication”
Wednesday, Nov. 16th
4:00-4:30pm
Room 355-D
26
Talk on Wednesday