ralph wittig, distinguished engineer office of the cto,...

44
Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit

Upload: others

Post on 21-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Revolutionizing the Datacenter

    Join the Conversation #OpenPOWERSummit

    Power-Efficient Machine Learning using FPGAs on POWER Systems

    Ralph Wittig, Distinguished Engineer

    Office of the CTO, Xilinx

    Join the Conversation #OpenPOWERSummit

  • Your logohere

    Super Human

    Humans: ~95%***

    Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

    * http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

    **

    Page 2

    http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference

  • Your logohere

    Super Human

    Humans: ~95%***

    Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

    CNNs far outperform non AI methods

    CNNs deliver super-human accuracy

    **

    * http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

    Page 3

    http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference

  • Your logohere

    CNNs Explained

    Page 4

  • Your logohere

    The Computation

    Page 5

  • Your logohere

    The Computation

    Page 6

  • Your logohere

    Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

    Page 7

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

    Page 8

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Continue along the row ...

    Page 9

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Before moving down to the next row

    Page 10

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    The first output feature map is complete

    Page 11

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Move onto the next output feature map by switching weights, and repeat

    Page 12

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Pattern repeats as before: same input volumes, different weight

    Page 13

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Complete the second output feature map plane

    Page 14

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Finally, after 256 weight sets have been used, the output feature map is complete

    Page 15

    Convolution

    13

    13

    384

    3

    3

    384

    13

    13

    256

    Input256

    KernelWeights

    Output

  • Your logohere

    Fully Connected Layers

    Page 16

  • Your logohere

    Fully Connected Layers

    a0,0

    a0,1

    a1,40 95

    a1,1

    a1,0w0,0,0

    w0,0,1

    )*(0

    ,0,0,00,1

    i

    ii wafa

    a2,1

    a2,0

    w1,40 95,1

    w1,40 95,0

    a0,40 95

    w0,0,40 95

    a2,99 9w1,40 95,99 9

    )*(0

    ,0,1,1999,2

    i

    ii wafa

    fc6 fc7fc7 fc8

    Page 17

  • Your logohere

    Compute: dominated by convolution (CONV) layers

    0.2

    1

    0.3

    4

    0.1

    7 3

    .87

    3.8

    7

    0.9

    0

    0.8

    3

    1.8

    5 5

    .55

    5.5

    5

    0.3

    0

    0.3

    0

    5.5

    5 9

    .25

    12

    .95

    0.4

    5

    0.4

    5

    5.5

    5 9

    .25

    12

    .95

    0.3

    0

    0.3

    0

    1.8

    5

    2.3

    1

    3.7

    0

    0.0

    8

    0.1

    0

    0.2

    1

    0.2

    1

    0.2

    1

    0.0

    3

    0.0

    3

    0.0

    3

    0.0

    3

    0.0

    3

    0.0

    1

    0.0

    1

    0.0

    1

    0.0

    1

    0.0

    1

    CaffeNet ZF VGG11 VGG16 VGG19

    CONV1 CONV2 CONV3 CONV4CONV5 FC6 FC7 FC8

    0.1

    4

    0.0

    6

    0.0

    1

    0.1

    5

    0.1

    5

    1.2

    3

    2.4

    6

    0.2

    9

    0.8

    8

    0.8

    8

    3.5

    4

    3.5

    4

    3.5

    4

    5.9

    0

    8.2

    6

    2.6

    5

    5.3

    1

    14

    .16

    23

    .59

    33

    .03

    1.7

    7

    3.5

    4

    18

    .87

    28

    .31

    37

    .75

    15

    0.9

    9

    20

    9.7

    2

    411

    .04

    411

    .04

    411

    .04

    67

    .11

    67

    .11

    67

    .11

    67

    .11

    67

    .11

    16

    .38

    16

    .38

    16

    .38

    16

    .38

    16

    .38

    CaffeNet ZF VGG11 VGG16 VGG19

    CONV1 CONV2 CONV3 CONV4

    Co

    mp

    ute

    GO

    Ps P

    er L

    ayer

    Mem

    ory

    Acc

    ess

    G R

    ead

    s Pe

    r La

    yer

    Source: Yu Wang, Tsinghua University, Feb 2016

    CNN Properties

    Memory BW: dominated by fully-connected (FC) layers

    Page 18

  • Your logohere

    Humans vs Machines

    Humans are six orders of magnitude more efficient

    *IBM Watson, ca 2012

    Source: Yu Wang, Tsinghua University, Feb 2016

    *

    Page 19

  • Your logohere

    Cost of Computation

    Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 20

  • Your logohere

    21

    Cost of Computation

    Stay in on-chip memory (1/100 x power)

    Use Smaller Multipliers (8bits vs 32bits: 1/16 x power)

    Fixed-Point vs Float (don’t waste bits on dynamic range)

    Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 21

  • Your logohere

    Improving Machine Efficiency

    Model Pruning

    Right-Sizing Precision

    Custom CNN Processor Architecture

    Page 22

  • Your logohere

    Pruning Elements Retrain to Recover Accuracy

    Train Connectivity

    Prune Connections

    Train Weights

    -4.5%

    -4.0%

    -3.5%

    -3.0%

    -2.5%

    -2.0%

    -1.5%

    -1.0%

    -0.5%

    0.0%

    0.5%

    40% 50% 60% 70% 80% 90% 100%

    Accu

    racy L

    oss

    Parametes Pruned Away

    L2 regularization w/o retrain L1 regularization w/o retrain

    L1 regularization w/ retrain L2 regularization w/ retrain

    L2 regularization w/ iterative prune and retrain

    Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Remove Low Contribution Weights (Synapses)

    Retrain Remaining Weights

    Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”http://arxiv.org/pdf/1506.02626v3.pdf

    Page 23

  • Your logohere

    Pruning Results: AlexNet

    Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

    9x Reduction In #Weights

    Most Reduction In FC Layers

    Page 24

  • Your logohere

    Pruning Results: AlexNet

    < 0.1% Accuracy Loss

    Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdfPage 25

  • Your logohere

    Inference with Integer Quantization

    Page 26

  • Your logohere

    Right-Sizing Precision

    Dynamic: Variable Format Fixed-Point (Per Layer)

    < 1% Accuracy Loss

    Network VGG16

    Data Bits Single-float 16 16 8 8

    Weight Bits Single-float 16 8 8 8 or 4

    Data Precision N/A 2-2 2-2 2-5/2-1 Dynamic

    Weight Precision N/A 2-15 2-7 2-7 Dynamic

    Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0%

    Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6%

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 27

  • Your logohere

    Right-Sizing Precision

    Fixed-Point Sufficient For Deployment (INT16, INT8)

    No Significant Loss in Accuracy (< 1%)

    >10x Energy Efficiency OPs/J (INT8 vs FP32)

    4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

    Page 28

  • Your logohere

    Improving Machine Efficiency

    CNN Model

    Pruned

    Floating-Point Model

    Pruned

    Fixed-Point Model

    Instructions

    FPGA Based

    Neural Network

    Processor

    Modelpruning

    Data/weightquantization

    Compilation

    Run

    Modified From: Yu Wang, Tsinghua University, Feb 2016 Page 29

  • Your logohere

    Xilinx Kintex® UltraScale™ KU115 (20nm)

    5520 DSP Cores,up to 500Mhz

    5.5 T OPs int16 (peak)

    4 GB DDR4-2400 & 38 GB/s

    55W TDP & 100 G OPs/W

    Single Slot, Low Profile FF

    OpenPOWER CAPI AlphaData ADM-PCIE-8K5

    Page 30

  • Your logohere

    FPGA Architecture

    CLB DSP CLBRAM RAM

    CLB DSP

    CLB DSP

    CLB

    CLB

    CLB DSP CLB

    RAM

    RAM

    RAM

    RAM

    RAM

    RAM

    . . . .

    . . . .

    2D Array Architecture (scales with Moore’s Law)

    Memory Proximate Computing (Minimize Data Moves)

    Broadcast Capable Interconnect (Data Sharing/Reuse)

    Page 31

  • Your logohere

    FPGA Arithmetic & Memory Resources

    Wij

    Dj

    Oi

    16-bitMultiplier

    Native 16-bit multiplier (or reduced power 8-bit)

    On-Chip RAMs store INT4, INT8, INT16, …

    Custom Quantization Formatting (Qm.n)

    48-bitAccumulator

    Q8.8Q2.14Qm.n

    CustomQuantization

    Custom WidthMemory

    INT4INT8

    INT16INT32FP16FP32

    Page 32

  • Your logohere

    Convolver Unit

    ⋯⋯ ⋯⋯

    ⋯⋯ ⋯⋯

    ⋯⋯

    MUX

    MUX

    Data buffer

    Weight buffer

    MultipliersAdder Tree

    X+

    9 Data Inputs

    9 Weight

    Inputs

    n Delays

    𝑚 Delays

    +

    ++

    +⋯

    X X

    X X X

    X X X

    Input

    Data

    Input

    Weight

    Output

    Data

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 33

  • Your logohere

    Convolver Unit

    ⋯⋯ ⋯⋯

    ⋯⋯ ⋯⋯

    ⋯⋯

    MUX

    MUX

    Data buffer

    Weight buffer

    MultipliersAdder Tree

    X+

    9 Data Inputs

    9 Weight

    Inputs

    n Delays

    𝑚 Delays

    +

    ++

    +⋯

    X X

    X X X

    X X X

    Input

    Data

    Input

    Weight

    Output

    Data

    Memory Proximate Compute2D Parallel Memory2D Operator Array

    INT16

    Serial to ParallelPing/Pong

    Serial to ParallelData Reuse: 8/9

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 34

  • Your logohere

    Processing Engine (PE)

    C

    Convolver

    Complex

    +

    +

    +

    +

    + NL PoolC

    C

    Output

    Buffer

    Input

    Buffer

    Data

    Bias

    Weights

    Intermediate Data

    Controller

    PE

    Adder

    Tree

    Bias Shift

    Data

    shift

    ……

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 35

  • Your logohere

    Processing Engine (PE)

    C

    Convolver

    Complex

    +

    +

    +

    +

    + NL PoolC

    C

    Output

    Buffer

    Input

    Buffer

    Data

    Bias

    Weights

    Intermediate Data

    Controller

    PE

    Adder

    Tree

    Bias Shift

    Data

    shift

    ……

    Memory SharingBroadcast Weights

    CustomQuantization

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 36

  • Your logohere

    Top Level

    Power CPUExternal

    Memory

    Pro

    ce

    ssin

    g S

    ys

    tem

    DMA w/ compression

    Data & Inst. Bus

    Input

    Buffer

    PE

    Computing Complex

    Output

    Buffer

    PE PE

    FIFO

    Co

    ntr

    oll

    er

    Pro

    gra

    mm

    ab

    le L

    og

    icConfig

    .

    Bus

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 37

  • Your logohere

    Top Level

    Power CPUExternal

    Memory

    Pro

    ce

    ssin

    g S

    ys

    tem

    DMA w/ compression

    Data & Inst. Bus

    Input

    Buffer

    PE

    Computing Complex

    Output

    Buffer

    PE PE

    FIFO

    Co

    ntr

    oll

    er

    Pro

    gra

    mm

    ab

    le L

    og

    icConfig

    .

    Bus

    SW ScheduledDataflow

    Decompress weights on the fly

    Multiple PEBlock Level Parallelism

    Ping Pong BuffersTransfers Overlap with Compute

    Source: Yu Wang, Tsinghua University, Feb 2016 Page 38

  • Your logohere

    FPGA Neural Net Processor

    Tiled Architecture (Parallelism & Scaling)

    Semi Static Dataflow (Pre-scheduled Data Transfers)

    Memory Reuse (Data Sharing across Convolvers)

    Page 39

  • Your logohere

    OpenPOWER CAPI

    Shared Virtual Memory

    System-Wide Memory Coherency

    Low Latency Control Messages

    POWER8CAP UNIT

    CAP PSL

    Peer Programming Model and Interaction Efficiency

    Page 40

  • Your logohere

    OpenPOWER CAPI

    POWER8CAP UNIT

    CAP PSL

    Power

    • Caffe, TensorFlow, etc• Load CNN Model• Call AuvizDNN Library

    Xilinx FPGA

    • AuvizDNN Kernel• Scalable & Fully Parameterized• Plug and Play Library

    Page 41

  • Your logohere

    OpenPOWER CAPI

    POWER8CAP UNIT

    CAP PSL

    14 Images/s/W (AlexNet)

    Batch Size 1

    Low Profile TDP

    Page 42

  • Your logohere

    Take Aways

    FPGA: Ideal Dataflow CNN Processor

    POWER/CAPI: Elevates Accelerators As Peers to CPUs

    FPGA CNN Libraries

    Page 43

  • Your logohere

    444/11/2016

    Thank You!