the spatter benchmark · 2019-03-03 · gather kernel, linear access impact of access sparsity 0 10...
TRANSCRIPT
![Page 1: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/1.jpg)
The Spatter Benchmark Or: Benchmarking and Modeling Sparse Memory Accesses
for Heterogeneous Systems
Patrick Lavin, Jeffrey Young, Jason Riedy, Rich Vuduc
![Page 2: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/2.jpg)
Purpose• Dense memory access is well understood, but it is
difficult to predict how a memory system will respond in irregular scenarios
• Indirection, poor spatial and temporal locality
• Spatter allows us to view performance changes across architectures, so that we can better understand their differences
• You can’t understand what you can’t measure
!2
![Page 3: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/3.jpg)
Spatter• Memory benchmark
based on a scatter/gatherkernels • Scatter Y[j[:]] = X[:]
• Gather Y[:] = X[i[:]]
• SG Y[j[:]] = X[i[:]]
• Designed to model sparse data movement in applications like SuperLU and kernels like SpGEMM
• Includes effects of indirection and random or sparse access
!3
Gather
SRC
idx152 8
DEST
![Page 4: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/4.jpg)
Configuration
• Backend - OpenMP, OpenCL, or CUDA
• Work per thread (work item)
• CUDA (or OpenCL) block size
• Buffer sizes, cache-ability, access stride
!4
![Page 5: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/5.jpg)
Examples
!5
![Page 6: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/6.jpg)
Examples - CSR SpMV
• Gather elements of x, then doa dot product with data in A.
A xy
=
=
for (i in range(nrows)): indices ←⃪ row[i] : row[i+1] gather(tmp, x, col[indices]) y[i] = dot_prod(val[indices], tmp)
!6
![Page 7: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/7.jpg)
Examples - CSC SpMV
• Scale some a column of A by the value in x, then scatter-accumulate into y.
for (i in range(ncols)): indices ←⃪ col[i] : col[i+1] tmp ←⃪ vector_scale(val[indices], x[i]) scatter_accum(y, row[indices], tmp)
A xy
=
=
!7
![Page 8: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/8.jpg)
Examples - SpGEMM• Scatter-accumulate
columns of A corresponding to non-zero entries in a column of B into a dense SPA buffer. Gather SPA into C.
A BC
=
=
Algorithm from Buluç and Gilbert: Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments https://doi.org/10.1137/110848244
for (j in range(ncols) : SPA = 0 //dense accumulation buffer for non-zero B(k,j) : scatter_accum(SPA, A(:,k)*B(k,j)) gather(C.val, SPA) gather(C.row, which(SPA)) C.col[j+1] = C.col[j] + nnz(SPA)
SPA:
!8
![Page 9: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/9.jpg)
Examples - Vectorization• Some forms of vectorization may naturally lead to Gather/
Scatter operations
for (j in range(N)): for (i in range(4)): out[j] += data[i,j]
Column-Major
for (j = 0; j < N; j+=8): for (i in range(4)): gather_accum_stride(temp,j+i, 8, 4) //gather 8 elements, //gap size 4 out[j:j+8] = temp
!9
![Page 10: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/10.jpg)
Example - SuperLU
• SuperLU spends a large portion its runtime on just scattering data
0
20
40
60
80
100
ND24K
BBMATH2O
Norm
alize
d ex
ecut
ion
time
SCATTER DGEMM REST
������� ������ ������� ��������
Chart credits: Piyush Sao !10
![Page 11: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/11.jpg)
Benchmark Output
!11
![Page 12: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/12.jpg)
Performance Exploration
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64Work per threadEf
f. Ba
ndw
idth
(% o
f Bab
elSt
ream
)
Tesla P100, CUDA BackendGather Bandwidth
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
BDW, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Power8, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Tesla P100, CUDA BackendScatter Bandwidth
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
BDW, OpenMP Backend
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64
Power8, OpenMP Backend
Sparsity1
2
4
8
16
32
64
128
!12
Uniform Stride
![Page 13: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/13.jpg)
Uniform Stride Access
0
10
20
30
40
50
60
70
80
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
GATHER kernel, Linear AccessImpact of Access Sparsity
0
10
20
30
40
50
60
70
80
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
SCATTER kernel, Linear AccessImpact of Access SparsityGather Scatter
CPU
GPU
!13
![Page 14: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/14.jpg)
Random AccessGather, Uniform
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendK40c−CUDA
P100−CUDA
Titan Xp−CUDA
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
KNL−OMP
Power8−OMP
SNB−OMP
ThunderX2−OMP
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendK40c−CUDA
P100−CUDA
Titan Xp−CUDA
GATHER kernel, Random AccessImpact of Access Sparsity
0
10
20
30
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparsity
Effe
ctive
Ban
dwid
th (%
of B
abel
Stre
am)
Device−BackendBDW−OMP
KNL−OMP
Power8−OMP
SNB−OMP
ThunderX2−OMP
GATHER kernel, Random AccessImpact of Access Sparsity
!14
![Page 15: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/15.jpg)
Energy Efficiency
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty
Effic
ienc
y (B
ytes
/Jou
le)
DeviceBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
GATHER kernel, Linear AccessImpact of Access Sparsity
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 1281 2 4 8 16 32 64 128Sparisty
Effic
ienc
y (B
ytes
/Jou
le)
DeviceBDW−OMP
GV100−CUDA
K40c−CUDA
KNL−OMP
P100−CUDA
Power8−OMP
SNB−OMP
ThunderX2−OMP
Titan Xp−CUDA
SCATTER kernel, Linear AccessImpact of Access Sparsity
Uniform Stride
Gather Scatter
CPU
GPU
!15
![Page 16: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/16.jpg)
What’s Next?• Partner with industry to run on upcoming systems
• Evaluation of slightly more complex synthetic traces
• Mostly stride-1
• Write collisions
• Gather/Scatter traces from real (DOE) mini-apps
• Measure impact of vector length (SVE and AVX) on generated code (and therefore cache performance)
• CILK backend for EMU, FPGA-specific OpenCL Backend
• More general kernels, with accumulation and a length buffer
• Simplify to present a STREAM-like result
!16
![Page 17: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/17.jpg)
More Info
!17
• Spatter.io
• Documentation
• Guide to easily plot your GPU against ours
• ArXiv Pre-print
• Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns
• https://arxiv.org/abs/1811.03743
• Code
• https://github.com/hpcgarage/spatter
![Page 18: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/18.jpg)
Acknowledgements
This material is based upon work supported by the National Science Foundation under Award #1710371.
![Page 19: The Spatter Benchmark · 2019-03-03 · GATHER kernel, Linear Access Impact of Access Sparsity 0 10 20 30 40 50 60 70 80 1 2 4 8 16 32 64 128 Sparsity Effective Bandwidth (% of BabelStream)](https://reader033.vdocument.in/reader033/viewer/2022060510/5f2702431474af51367f3c49/html5/thumbnails/19.jpg)
The Spatter Benchmark (spatter.io)
Or: Benchmarking and Modeling Sparse Memory Accesses for Heterogeneous Systems
Patrick Lavin, Jeffery Young, Jason Riedy, Rich Vuduc