communication-minimizing 2d convolution in gpu registers
DESCRIPTION
Communication-Minimizing 2D Convolution in GPU Registers . Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer. [email protected]. University of California, Berkeley. Overview. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/1.jpg)
1
Communication-Minimizing 2D Convolution in GPU Registers
Forrest N. Iandola David SheffieldMichael AndersonP. Mangpo Phothilimthana Kurt Keutzer
University of California, Berkeley
![Page 2: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/2.jpg)
2Forrest Iandola [email protected]
Overview
• Convolution is a recurring computational pattern in a broad range of computer vision applications
• Memory communication is the bottleneck for convolution on modern GPUs
• How to minimize memory communication overhead in convolution– Texture cache– Loop blocking
• Up to 4.5x speedup over existing GPU implementations from NVIDIA, OpenCV, and others
![Page 3: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/3.jpg)
3
Why focus on convolution?• Berkeley ParLab project identified 15 recurring
computational patterns in computer vision
Forrest Iandola [email protected]
• Small filters (2x2 – 7x7)• Feature extraction • Sliding-window object
detection
• If we want fast computer vision, we need fast convolution
CVPR 2007 – 2011
object recognition track
15 C
ompu
ter V
isio
n P
atte
rns
![Page 4: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/4.jpg)
4Forrest Iandola [email protected]
What limits the performance of convolution?
• Roofline model [1] divides a program’s execution time into two parts:– Computational cost (GFLOPS/s)– Communication cost (GB/s) – memory traffic, I/O, etc.
• No program can outperform the hardware bound on computation or communication
[1] S. Williams, A. Waterman, D. Patterson. Roofline: An Insightful Visual Performance Model for Floating Point Programs and Multicore Architectures. Communications of the ACM, 2009
![Page 5: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/5.jpg)
5
What limits the performance of convolution?
Roofline Model of computational performance
Forrest Iandola [email protected]
Fast
Slow
Memory Bounded Computation Bounded
![Page 6: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/6.jpg)
6Forrest Iandola [email protected]
What limits the performance of convolution?
• Convolution on NVIDIA GPUs:– Communication between the GPU’s off-chip DRAM and
on-chip caches is the bottleneck– This doesn’t include communication between the CPU
and GPU, though this can also be an issue• If we want fast computer vision, we need fast
convolution.• If we want fast convolution on GPUs, we need
to optimize memory communication.
![Page 7: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/7.jpg)
7
Exploiting the GPU Memory Architecture
GPU Global Memory (DRAM)
Texture Cache
Registers L1 Cache / Shared Memory
893 GB/s
123 GB/s
Memory per GPU Multiprocessor
129 Gtexels/s
CPU DRAM NVIDIA GTX6808 GB/s
L2 Cache
Optimization 1:
Use the Texture Cache
Threads
![Page 8: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/8.jpg)
8
Data Reuse with Loop Blocking
Typical Implementation: no data reuse at the register level
Forrest Iandola [email protected]
9 input pixels
1 output pixel
![Page 9: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/9.jpg)
9Forrest Iandola [email protected]
Data Reuse with Loop Blocking
Our approach: reuse data by doing more work per thread
Optimization 2:
Block the image in registers
9 input pixels
1 output pixel
16 input pixels
4 output pixels
4 inputs per output
Typical Implementation: no data reuse at the register level
![Page 10: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/10.jpg)
11
Comparison with Related Work
NVIDIA GTX680(Kepler)
Inverse roofline model
![Page 11: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/11.jpg)
12
Comparison with Related Work
NVIDIA GTX680(Kepler)
With texture cacheand blocking (ours)
![Page 12: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/12.jpg)
13
Comparison with Related Work
NVIDIA GTX680(Kepler)
![Page 13: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/13.jpg)
14
Comparison with Related Work
NVIDIA GTX680(Kepler)
![Page 14: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/14.jpg)
15
Comparison with Related Work
NVIDIA GTX680(Kepler)
![Page 15: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/15.jpg)
16
Comparison with Related Work
NVIDIA GTX680(Kepler)
![Page 16: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/16.jpg)
17
Comparison with Related Work
4.5x speedup NVIDIA
GTX680(Kepler)
![Page 17: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/17.jpg)
18Forrest Iandola [email protected]
Are we done?
• Are we done optimizing memory communication?• I think so. We achieved the memory bandwidth
bound for small filters.
• Future work: optimize computation some more!
![Page 18: Communication-Minimizing 2D Convolution in GPU Registers](https://reader036.vdocument.in/reader036/viewer/2022062305/56816725550346895ddbb76a/html5/thumbnails/18.jpg)
19
Conclusions
• If we want fast computer vision, we need fast convolution.
• If we want fast convolution on GPUs, we need to optimize memory communication.
• Up to 4.5x faster than existing GPU languages and libraries
• Download our code! https://github.com/forresti/convolution– Use/modify it for your
language/library/application
Forrest Iandola [email protected]