understanding the efficiency of gpu algorithms for matrix-matrix multiplication kayvon fatahalian,...
TRANSCRIPT
![Page 1: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/1.jpg)
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix
Multiplication
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Stanford UniversityAugust 30, 2004
= *
![Page 2: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/2.jpg)
Motivation: Harness GPU Performance
0
1
2
3
4
5
6
P4 3.4Ghz 6800 Ultra X800 XT PE
Peak FLOPSMemory BW
Rela
tive P
erf
orm
an
ce
![Page 3: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/3.jpg)
Streaming Computation on GPUs
Kernel function(shader)
GPUs accelerate streaming numerical algorithms
Data parallelism High ratio of arithmetic to data access Little data reuse
Input Elements Output Elements
![Page 4: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/4.jpg)
Streaming Computation on GPUs
Level 1 BLAS operationsBuck et al. [2004]
Fluid solvers Kruger & Westermann
[2003]Boltz et al.
[2003]
Image processing Apple Corp. [2004]
McCormick et al. [2004]
Segmentation Sherbondy et al.
[2003]
Database operations Govindaraju et al.
[2004]
Data Clustering Hall et al. [2004]
![Page 5: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/5.jpg)
Dense Matrix Multiplication
Abundant data parallelism
*=
BAC
Regular data access (no branching)
High ratio of computation to data access
![Page 6: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/6.jpg)
Dense Matrix Multiplication
Widely used computational kernel
Building block for LAPACK library
![Page 7: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/7.jpg)
Matrix Multiplication on GPUs
Larsen & McAllister [2001]
Moravansky [2003]
Hall et al. [2003]
Limited analysis of performance
![Page 8: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/8.jpg)
Overview
GPU Implementations
Results
Analysis: Why GPUs are slow
Ways to Make GPUs Better
![Page 9: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/9.jpg)
CPU-Based Approaches
High performance matrix multiplication algorithms are cache aware
Partition computation into submatrix multiplications
Load input submatrices into cache Multiply submatrices Store output submatrix to memory
*=
BAC
![Page 10: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/10.jpg)
Method 1: Column Packed (CP)
=
Larsen & McAllister [SC2001]Moravansky [2003]
*
C A B4 elements stored per texel 4x4 matrix by 4-vector
multiplications
xyzw
![Page 11: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/11.jpg)
Method 2: Submatrix Packed (SP)
=
Hall et al. [2003]
*
C A B2x2 submatrix stored per texel
x y
z w
2x2 by 2x2 submatrix multiplications
![Page 12: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/12.jpg)
Alternative Approaches Ineffective
Varied mapping into texture memory
Altered rasterization order with geometry
Single quad most effective
Utilized multiple outputs
Varied amount of loop unrolling
Column packed: unroll maximally
Submatrix packed: unroll 128 times
![Page 13: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/13.jpg)
Performance Results
Pentium 4 3Ghz CPU, 512KB L2 cache 12 GFLOPS peak compute 44.1GB/sec cache BW Using sgemm routine from ATLAS package
NVIDIA GeForce 5900 Ultra GeForce 6800 Ultra
ATI Radeon 9800 XT Radeon X800 XT PE
(prerelease 500Mhz mem / 500Mhz core clock)
![Page 14: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/14.jpg)
Previous Generation GPUs
0
2
4
6
8
10
12
P4 3Ghz 5900 Ultra 9800 XT0
5
10
15
20
25
30
GFLOPSBandwidth
Multiplication of 1024x1024 Matrices
GFLO
PS
GB
/sec
![Page 15: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/15.jpg)
Current Generation GPUs
0
2
4
6
8
10
12
P4 3Ghz 6800 Ultra X800 XT PE0
5
10
15
20
25
30
GFLOPSBandwidth
Multiplication of 1024x1024 Matrices
GFLO
PS
GB
/sec
![Page 16: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/16.jpg)
Fragment Processor Data Paths
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
![Page 17: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/17.jpg)
GPU Microbenchmarks
0
10
20
30
40
50
60
70
5900 Ultra 6800 Ultra 9800 XT X800 XT PE
GFLO
PS
Peak Arithmetic Rate
![Page 18: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/18.jpg)
GPU Microbenchmarks
Observed Bandwidth
0
5
10
15
20
25
30
5900 Ultra 6800 Ultra 9800 XT X800 XT PE
GB
/sec
Cache BWSeq BW
![Page 19: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/19.jpg)
Fragment Processor Data Paths
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
High bandwidth(texture filtering)
Low bandwidth(1 float/clock)
Fragment processor consumes data at8X rate texture provides it!
1 4-wideMAD/clock
![Page 20: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/20.jpg)
Datapaths Designed for Shading
From L2
FragmentProcessor
L1 TextureCache
To FrameBuffer
TextureUnit
8 to 1 reduction inamount of data
4 componentsper clock
8 bit components 2 to 1 ratio of compute to bandwidth
Texture units filter (reduce) data Shaders use interpolated values & constants
![Page 21: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/21.jpg)
Compute and Bandwidth Efficiency
0
20
40
60
80
100
5900Ultra
6800Ultra
9800XT
X800XT PE
P4 3Ghz
Perc
en
tag
e o
f P
eak
ComputeBandwidth
GPU algorithms are severely bandwidth limited!
![Page 22: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/22.jpg)
Minimize Texture Fetches
Block in shader register file
Would need 8x8 submatrices to run at peak rates
Limited to 4x4 submatrices by available outputs
![Page 23: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/23.jpg)
Improvement 1: Widen Datapath
Fragment processor receives cached data more quickly
Expect performance to improve linearly with increase in bandwidth Need ~4X improvement to achieve peak perf
But L2 may no longer be able to fill L1
![Page 24: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/24.jpg)
Improvement 2: Larger Scratch Space
Requires large number of registers
Needs large number of output values
Reduces texture bandwidth requirements
Performance increases linearly with dimension of submatrices
Increases amount of per-pixel state Storage increases as square of dimension
of submatrices Requires 16X space of SP method for
peak perf
![Page 25: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/25.jpg)
Summary
GPU algorithms for matrix-matrix multiplication run inefficiently Best algorithms achieve below 20% of
peak performance Saturate data path between texture and
FP units
Cache-aware software blocking strategies do not improve performance Cannot exploit data reuse Hardware limits algorithm efficiency
![Page 26: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/26.jpg)
Summary
Hardware changes required to improve efficiency Widen path between texture and register file Output large number of values from shaders
Improved efficiency would make GPUs powerful platform for broader class of numerical algorithms
![Page 27: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/27.jpg)
Acknowledgements
Thanks to Ian Buck, Mike Houston, Sean Treichler, Nick Triantos, Steve Morein
Support from ATI, NVIDIA, DARPA, IBM, SONY
Rambus Stanford Graduate Fellowship
Stanford School of Engineering Fellowship
![Page 28: Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August](https://reader037.vdocument.in/reader037/viewer/2022110116/551be8b7550346af588b626f/html5/thumbnails/28.jpg)
Questions?