high-performance gpu programming for deep learning
TRANSCRIPT
High-Performance GPU Programming for Deep Learning
7 April 2016 Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ne r vana
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
Proprietary and confidential. Do not distribute.ne r vana
GEMM: Basics
3
C = AB
Proprietary and confidential. Do not distribute.ne r vana
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
Proprietary and confidential. Do not distribute.ne r vana
Batched GEMM tiles 32 x 32GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
Proprietary and confidential. Do not distribute.ne r vana
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFL
OPS
Proprietary and confidential. Do not distribute.ne r vana
hGEMM Results - TN
7
GFL
OPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
Proprietary and confidential. Do not distribute.ne r vana
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
Proprietary and confidential. Do not distribute.ne r vana
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
Proprietary and confidential. Do not distribute.ne r vana
Winograd: input transform
10
Input Feature Map
4x4 stride 2• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ne r vana
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ne r vana
Winograd: batched GEMM
12
Proprietary and confidential. Do not distribute.ne r vana
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ne r vana 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fpropWinograd fp32 bpropWinograd fp32 updatecuDNN fp32 fpropcuDNN fp32 bpropcuDNN fp32 update
Algo
rithm
ic S
peed
up
Batch Size
Proprietary and confidential. Do not distribute.ne r vana
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16Nervana fp32CuBLAS fp16CuBLAS fp32
Batch Size
Algo
rithm
ic S
peed
up
Proprietary and confidential. Do not distribute.ne r vana
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
Proprietary and confidential. Do not distribute.ne r vana
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor