high-performance gpu programming for deep learning

High-Performance GPU Programming for Deep Learning

7 April 2016 Scott Gray

Nervana Systems

MAKING MACHINES SMARTER.™

Proprietary and confidential. Do not distribute.ne r vana

High-Performance GPU kernels for deep learning

2

• Fast matrix multiply for small minibatches

• Direct convolution leveraging GEMM advances

• Even faster convolution with Winograd


GEMM: Basics

3

C = AB


GEMM: Memory Load

4

Outer product contiguous Outer product strided

threads

memory load

single tile

batched GEMM


Batched GEMM tiles 32 x 32GEMM tile 32 x 64GEMM tile 32 x 32

GEMM: Tile sizes

5

threads

shared memory load


hGEMM Results - NN

6

Nx3072x3072 NN op

0

1500

3000

4500

6000

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)

GFL

OPS


hGEMM Results - TN

7

GFL

OPS

Nx3072x3072 TN op

0

1500

3000

4500

6000

32 64 96 128

Nervana 32x32 cuBLAS 128x64

Batch Size (N)


Direct convolution is still relevant

8

• Striding

• Odd-size filters

• Placeholder until faster algo can be implemented

• Often faster for single image or first small C layer


Direct convolution: implementation details

9

• Batched GEMM for efficient transpose and higher occupancy

• Compound outer product block remapping

• Square wave pattern for P,Q block mapping

• Slicing: shared memory lookup + integer division

• N vs C contiguous

• Single P,Q vs tiled P,Q

• Bprop as upside down fprop

• Update specific optimizations


Winograd: input transform

10

Input Feature Map

4x4 stride 2• Input transform

• 2D Winograd is a nested

product of 1D transforms

• Transforms can be

simplified to remove zeros


Winograd: filter transform

11

• Filter transform

• Same as input but with

different coefficients

• Transform each feature map

independently


Winograd: batched GEMM

12


Winograd: output transform

13

Output Feature Map

• Output transform

• Same as input and filter

• Transform back to pixel

space to obtain 2x2 output

tile

Proprietary and confidential. Do not distribute.ne r vana 14

Performance: VGG

VGG fp32 - Totals by operation

0

0.5

1

1.5

2

64 32 16 8 4 2 1

Winograd fp32 fpropWinograd fp32 bpropWinograd fp32 updatecuDNN fp32 fpropcuDNN fp32 bpropcuDNN fp32 update

Algo

rithm

ic S

peed

up

Batch Size


Performance: Alexnet convolutional layers

15

Alexnet Totals

0

0.5

1

1.5

2

128 64 32 16 8 4

Nervana fp16Nervana fp32CuBLAS fp16CuBLAS fp32

Batch Size

Algo

rithm

ic S

peed

up


Compounding

16

• alpha / beta

• bias

• relu, prelu, tanh, …

• bprop relu, …

• bprop bias

• batchnorm mean

Compounding inside of GEMM and conv for free:


Summary

17

• Nervana has the fastest tools for deep learning

• neon with state-of-the-art Maxwell kernels

• Nervana Cloud with multi-GPU training

• Watch for Nervana Engine, our deep learning processor

high-performance gpu programming for deep learning

Engineering