ralph wittig, distinguished engineer office of the cto,...

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Power-Efficient Machine Learning using FPGAs on POWER Systems

Ralph Wittig, Distinguished Engineer

Office of the CTO, Xilinx

Join the Conversation #OpenPOWERSummit

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

**

Page 2

http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

CNNs far outperform non AI methods

CNNs deliver super-human accuracy

**

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Page 3

http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference

Your logohere

CNNs Explained

Page 4

Your logohere

The Computation

Page 5

Your logohere

The Computation

Page 6

Your logohere

Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

Page 7

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

Page 8

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Continue along the row ...

Page 9

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Before moving down to the next row

Page 10

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

The first output feature map is complete

Page 11

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Move onto the next output feature map by switching weights, and repeat

Page 12

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Pattern repeats as before: same input volumes, different weight

Page 13

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Complete the second output feature map plane

Page 14

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Finally, after 256 weight sets have been used, the output feature map is complete

Page 15

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Your logohere

Fully Connected Layers

Page 16

Your logohere

Fully Connected Layers

a0,0

a0,1

a1,40 95

a1,1

a1,0w0,0,0

w0,0,1

)*(0

,0,0,00,1

i

ii wafa

a2,1

a2,0

w1,40 95,1

w1,40 95,0

a0,40 95

w0,0,40 95

a2,99 9w1,40 95,99 9

)*(0

,0,1,1999,2

i

ii wafa

fc6 fc7fc7 fc8

Page 17

Your logohere

Compute: dominated by convolution (CONV) layers

0.2

1

0.3

4

0.1

7 3

.87

3.8

7

0.9

0

0.8

3

1.8

5 5

.55

5.5

5

0.3

0

0.3

0

5.5

5 9

.25

12

.95

0.4

5

0.4

5

5.5

5 9

.25

12

.95

0.3

0

0.3

0

1.8

5

2.3

1

3.7

0

0.0

8

0.1

0

0.2

1

0.2

1

0.2

1

0.0

3

0.0

3

0.0

3

0.0

3

0.0

3

0.0

1

0.0

1

0.0

1

0.0

1

0.0

1

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4CONV5 FC6 FC7 FC8

0.1

4

0.0

6

0.0

1

0.1

5

0.1

5

1.2

3

2.4

6

0.2

9

0.8

8

0.8

8

3.5

4

3.5

4

3.5

4

5.9

0

8.2

6

2.6

5

5.3

1

14

.16

23

.59

33

.03

1.7

7

3.5

4

18

.87

28

.31

37

.75

15

0.9

9

20

9.7

2

411

.04

411

.04

411

.04

67

.11

67

.11

67

.11

67

.11

67

.11

16

.38

16

.38

16

.38

16

.38

16

.38

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4

Co

mp

ute

GO

Ps P

er L

ayer

Mem

ory

Acc

ess

G R

ead

s Pe

r La

yer

Source: Yu Wang, Tsinghua University, Feb 2016

CNN Properties

Memory BW: dominated by fully-connected (FC) layers

Page 18

Your logohere

Humans vs Machines

Humans are six orders of magnitude more efficient

*IBM Watson, ca 2012

Source: Yu Wang, Tsinghua University, Feb 2016

*

Page 19

Your logohere

Cost of Computation

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 20

Your logohere

21

Cost of Computation

Stay in on-chip memory (1/100 x power)

Use Smaller Multipliers (8bits vs 32bits: 1/16 x power)

Fixed-Point vs Float (don’t waste bits on dynamic range)

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 21

Your logohere

Improving Machine Efficiency

Model Pruning

Right-Sizing Precision

Custom CNN Processor Architecture

Page 22

Your logohere

Pruning Elements Retrain to Recover Accuracy

Train Connectivity

Prune Connections

Train Weights

-4.5%

-4.0%

-3.5%

-3.0%

-2.5%

-2.0%

-1.5%

-1.0%

-0.5%

0.0%

0.5%

40% 50% 60% 70% 80% 90% 100%

Accu

racy L

oss

Parametes Pruned Away

L2 regularization w/o retrain L1 regularization w/o retrain

L1 regularization w/ retrain L2 regularization w/ retrain

L2 regularization w/ iterative prune and retrain

Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Remove Low Contribution Weights (Synapses)

Retrain Remaining Weights

Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”http://arxiv.org/pdf/1506.02626v3.pdf

Page 23

Your logohere

Pruning Results: AlexNet

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

9x Reduction In #Weights

Most Reduction In FC Layers

Page 24

Your logohere

Pruning Results: AlexNet

< 0.1% Accuracy Loss

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdfPage 25

Your logohere

Inference with Integer Quantization

Page 26

Your logohere


Dynamic: Variable Format Fixed-Point (Per Layer)

< 1% Accuracy Loss

Network VGG16

Data Bits Single-float 16 16 8 8

Weight Bits Single-float 16 8 8 8 or 4

Data Precision N/A 2-2 2-2 2-5/2-1 Dynamic

Weight Precision N/A 2-15 2-7 2-7 Dynamic

Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0%

Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6%

Source: Yu Wang, Tsinghua University, Feb 2016 Page 27

Your logohere


Fixed-Point Sufficient For Deployment (INT16, INT8)

No Significant Loss in Accuracy (< 1%)

>10x Energy Efficiency OPs/J (INT8 vs FP32)

4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

Page 28

Your logohere

Improving Machine Efficiency

CNN Model

Pruned

Floating-Point Model

Pruned

Fixed-Point Model

Instructions

FPGA Based

Neural Network

Processor

Modelpruning

Data/weightquantization

Compilation

Run

Modified From: Yu Wang, Tsinghua University, Feb 2016 Page 29

Your logohere

Xilinx Kintex® UltraScale™ KU115 (20nm)

5520 DSP Cores,up to 500Mhz

5.5 T OPs int16 (peak)

4 GB DDR4-2400 & 38 GB/s

55W TDP & 100 G OPs/W

Single Slot, Low Profile FF

OpenPOWER CAPI AlphaData ADM-PCIE-8K5

Page 30

Your logohere

FPGA Architecture

CLB DSP CLBRAM RAM

CLB DSP

CLB DSP

CLB

CLB

CLB DSP CLB

RAM

RAM

RAM

RAM

RAM

RAM

. . . .

. . . .

2D Array Architecture (scales with Moore’s Law)

Memory Proximate Computing (Minimize Data Moves)

Broadcast Capable Interconnect (Data Sharing/Reuse)

Page 31

Your logohere

FPGA Arithmetic & Memory Resources

Wij

Dj

Oi

16-bitMultiplier

Native 16-bit multiplier (or reduced power 8-bit)

On-Chip RAMs store INT4, INT8, INT16, …

Custom Quantization Formatting (Qm.n)

48-bitAccumulator

Q8.8Q2.14Qm.n

CustomQuantization

Custom WidthMemory

INT4INT8

INT16INT32FP16FP32

Page 32

Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯ ⋯⋯

⋯⋯

MUX

MUX

Data buffer

Weight buffer

MultipliersAdder Tree

X+

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

①

①

②

②

③

③

+

++

⋯

+⋯

X X

X X X

X X X

Input

Data

Input

Weight

Output

Data


Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯ ⋯⋯

⋯⋯

MUX

MUX

Data buffer

Weight buffer

MultipliersAdder Tree

X+

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

①

①

②

②

③

③

+

++

⋯

+⋯

X X

X X X

X X X

Input

Data

Input

Weight

Output

Data

Memory Proximate Compute2D Parallel Memory2D Operator Array

INT16

Serial to ParallelPing/Pong

Serial to ParallelData Reuse: 8/9


Your logohere

Processing Engine (PE)

C

Convolver

Complex

+

+

+

+

+ NL PoolC

C

Output

Buffer

Input

Buffer

Data

Bias

Weights

Intermediate Data

Controller

PE

Adder

Tree

Bias Shift

Data

shift

……

…

…


Your logohere

Processing Engine (PE)

C

Convolver

Complex

+

+

+

+

+ NL PoolC

C

Output

Buffer

Input

Buffer

Data

Bias

Weights

Intermediate Data

Controller

PE

Adder

Tree

Bias Shift

Data

shift

……

…

…

Memory SharingBroadcast Weights

CustomQuantization


Your logohere

Top Level

Power CPUExternal

Memory

Pro

ce

ssin

g S

ys

tem

DMA w/ compression

Data & Inst. Bus

Input

Buffer

PE

Computing Complex

Output

Buffer

PE PE

FIFO

Co

ntr

oll

er

Pro

gra

mm

ab

le L

og

icConfig

.

Bus

…


Your logohere

Top Level

Power CPUExternal

Memory

Pro

ce

ssin

g S

ys

tem

DMA w/ compression

Data & Inst. Bus

Input

Buffer

PE

Computing Complex

Output

Buffer

PE PE

FIFO

Co

ntr

oll

er

Pro

gra

mm

ab

le L

og

icConfig

.

Bus

…

SW ScheduledDataflow

Decompress weights on the fly

Multiple PEBlock Level Parallelism

Ping Pong BuffersTransfers Overlap with Compute


Your logohere

FPGA Neural Net Processor

Tiled Architecture (Parallelism & Scaling)

Semi Static Dataflow (Pre-scheduled Data Transfers)

Memory Reuse (Data Sharing across Convolvers)

Page 39

Your logohere

OpenPOWER CAPI

Shared Virtual Memory

System-Wide Memory Coherency

Low Latency Control Messages

POWER8CAP UNIT

CAP PSL

Peer Programming Model and Interaction Efficiency

Page 40

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

Power

• Caffe, TensorFlow, etc• Load CNN Model• Call AuvizDNN Library

Xilinx FPGA

• AuvizDNN Kernel• Scalable & Fully Parameterized• Plug and Play Library

Page 41

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

14 Images/s/W (AlexNet)

Batch Size 1

Low Profile TDP

Page 42

Your logohere

Take Aways

FPGA: Ideal Dataflow CNN Processor

POWER/CAPI: Elevates Accelerators As Peers to CPUs

FPGA CNN Libraries

Page 43

Your logohere

444/11/2016

Thank You!

ralph wittig, distinguished engineer office of the cto,...

Documents