phd forum...2020/01/01  · robust and energy-efficient deep learning systems muhammad abdullah...

1
Robust and Energy-Efficient Deep Learning Systems Muhammad Abdullah Hanif 1 (Ph.D. Candidate), Muhammad Shafique 2 (Advisor) 1 Technische Universität Wien (TU Wien), Vienna, Austria 2 Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, UAE PHD FORUM Selected Publications [DAC’17] M. A. Hanif , R. Hafiz, O. Hasan, M. Shafique, “ QuAd: Design and analysis of quality-area optimal low-latency approximate adders”, Design Automation Conference ( DAC), pp. 16, 2017. Received a HiPEAC Paper Award. [DATE’18] M. A. Hanif , R. Hafiz, M. Shafique, “Error resilience analysis for systematically employing approximate computing in convolutional neural networks”, DATE, pp. 913916, 2018.. [DAC’19] M. A. Hanif , F. Khalid, M. Shafique, “CANN: Curable approximations for high-performance deep neural network accelerators”, DAC, pp. 16, 2019. Received a HiPEAC Paper Award. [DAC’20] M. A. Hanif , R. Hafiz, O. Hasan, M. Shafique, “ PEMACx: A probabilistic error analysis methodology for adders with cascaded approximate units”, DAC, pp. 16, 2020. Received a HiPEAC Paper Award. [DATE’21] M. A. Hanif , M. Shafique, “DNN-Life: An energy-efficient aging mitigation framework for improving the lifetime of on-chip weight memories in deep neural network hardware architectures”, DATE, pp. 1-6, 2021. Overview of Our Methodology Problems and Motivation CANN: Curable Approximations [DAC’19] SalvageDNN: Fault-Aware Mapping [RSTA’20] HW/SW Approximations [JOLPE’18] [JOLPE’18] M. A. Hanif , A. Marchisio, T. Arif, R. Hafiz, S. Rehman, M. Shafique, “X-DNNs: Systematic cross-layer approximations for energy-efficient deep neural networks”, Journal of Low Power Electronics (JOLPE), vol. 14, no. 4, pp. 520534, 2018. [IOLTS’18] M. A. Hanif , F. Khalid, R. V. W. Putra, S. Rehman, M. Shafique, “Robust machine learning systems: Reliability and security for deep neural networks”, IEEE International Symposium on On-Line Testing and Robust System Design ( IOLTS), pp. 257260, 2018. [IOLTS’20] M. A. Hanif , M. Shafique, “Dependable deep learning: Towards cost -efficient resilience of deep neural network accelerators against soft errors and perma-nent faults”, IOLTS, pp. 14, 2020. [RSTA’20] M. A. Hanif , M. Shafique, “SalvageDNN: Salvaging deep neural network accelerators with permanent faults through saliency-driven fault-aware map-ping,” Philosophical Transactions of the Royal Society A (RSTA), vol. 378, no. 2164, pp. 20190164, 2020. dblp link: https://dblp.org/pid/191/2451.html Datasets Hardware-Induced Reliability Threats Soft Errors Permanent Faults Aging DNN Architecture/ Pre-trained DNNs DNN Hardware Architecture Curable Approximations [DAC’19] Error Resilience Analysis of DNNs [DATE’18, IOLTS’18] Analytical Methods for Error Estimation of Approximate Modules [DAC’17, DAC’20] Design Space Exploration of Approximate Modules [DAC’17, DAC’20] Designer-Induced Faults for Energy-Efficient DNN Inference [JOLPE’18] Pruning + Quantization Functional Approximations Fault-Aware Mapping [RSTA’20] Fault-Aware Training Range Restriction- based Soft Error Mitigation [IOLTS’20] Software-level Techniques for Mitigating Reliability Threats [IOLTS’20] Fault-Aware Pruning Optimized DNNs Optimized DNN Hardware Robust and Energy- Efficient DNN Inference Permanent-Fault Mitigation Support [RSTA’20] Aging Mitigation [DATE’21] Soft Error Mitigation Support Hardware Modifications for Mitigating Reliability Threats [IOLTS’20] Applications Stage N Stage 2 Accu. Module 1 2 Accu. Module Accu. Module Stage 1 Stage N Stage 2 Stage 1 Approx. Module 1 2 Approx. Module Approx. Module + Stage N+1 Cu Module Stage 2 2 + 2 ( 2 ) C&DAx Module Stage N ( ) C&DAx Module + ( 1 ) Stage 1 DAx Module 1 + 1 Proposed Array PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Cu Cu Cu Cu Activations Outputs Weights Conventional Array PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Outputs Weights Activations PE PE Cu LEGEND: PE with DAx MAC PE with C&DAx MAC Curing Module PE PE with Accurate MAC LEGEND : Un-complemented partial product bits 1’b1 Discarded bits Partial sum bits Complemented partial product bits Error bit from previous stage () Error signal to the next stage Accurate C&DAx Proposed MAC Designs Step 1 Step 4 Step 5 () Addition Addition Step 5 Step 1 Step 4 Addition Error bit from the previous stage 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Test Accuracy [%age] Model Size Reduction [%age] a b c d e 1. Structured Pruning (VGG11, Cifar10) 0 50 100 4 5 6 7 8 9 10 Test Accuracy [%age] Bit Width a b c d e 2. Uniform Quantization (VGG11, Cifar10) Accuracy starts decreasing at the same bit-width 0 50 100 Accurate Config. 1 Config. 2 Config. 3 Config. 4 Config. 5 Multiplier Configuration Test Acuracy [%age] a b c d e 3. Hardware Approximation (VGG11, Cifar10, 8-bit) PDP [fJ] 88.40 83.66 77.71 70.60 70.90 67.40 MSE 0 0.3 9.8 266.3 3102.3 24806 Filter 1 11 = −0.63 12 = 0.25 21 = −0.22 22 = 0.17 Filter 2 11 = 0.25 12 = −0.41 21 = −0.43 22 = 0.37 Filter 3 11 = 0.59 12 = −0.39 21 = 0.45 22 = 0.67 Filter 4 11 = −0.71 12 = 0.55 21 = 0.34 22 = −0.66 Conventional Mapping of filters onto a Faulty Systolic Array -0.63 0.25 0.59 -0.71 0.25 -0.41 -0.39 0.55 -0.22 -0.43 0.45 0.34 0.17 0.37 0.67 -0.66 11 12 21 22 Fault-Aware Mapping 0.25 -0.63 -0.71 0.59 -0.41 0.25 0.55 -0.39 -0.43 -0.22 0.34 0.45 0.37 0.17 -0.66 0.67 11 12 21 22 Saliency Computation of the weights of the DNN Rearrangement of the Parameters to avoid extra run time overheads Hardware Modification Dataflow/ Mapping Scheme DNN Accelerator Post fabrication testing Fault Map of the Hardware Modified Network For each layer Optimized DNN Accelerator Minimization of the Sum of Saliency of weights that must be pruned due to faulty MACs Deep Neural Network Techniques for Robust and Energy-Efficient Deep Neural Network Inference Natural Language Processing Image Classification, Object Detection & Localization, etc. Compute Cost [OPs] AlexNet 1.5B VGGNet 19.6B Inception 2B ResNet-152 11B Design Methods to Improve Robustness and Energy-Efficiency of DNNs Cost-effective techniques for mitigating reliability threats HW/SW approximations for achieving high energy-efficiency Manifestation Physical Level Soft Errors Process Variation Aging Core1 Core2 Core9 Core10 Core17 Core18 Core25 Core26 Core33 Core34 Core41 Core42 Core49 Core50 Core57 Core58 Core3 Core4 Core5 Core6 Core7 Core8 Core11 Core12 Core13 Core14Core15 Core16 Core19 Core20 Core21 Core22Core23 Core24 Core27 Core28 Core29 Core30Core31 Core32 Core35 Core36 Core37 Core38Core39 Core40 Core43 Core44 Core45 Core46Core47 Core48 Core51 Core52 Core53 Core54Core55 Core56 Core59 Core60 Core61 Core62 Core63 Core64 Faster Core Slower Core Drain Source p+ n – substrate Gate Oxide Layer V g = – V dd Si H TRAP O H + p+ n+ n+ P-Well P-Substrate Isolation Gate + - + - + - + -+ - + - + -+ - + - + - [adapted from K. Kang et al., ASP-DAC’08] [adapted from R. Baumann, TDMR’05] [adapted from B. Raghunathan et al., DATE’13] 0000 0000 0000 1111 1111 1111 0000 1010 1000 0010 1100 1111 0111 0111 0100 0010 R0 R1 R2 R3 Time Failure rate Wearout 0 10 20 30 40 50 60 70 80 Frequency (GHz) 7 6 5 4 3 2 7.3 GHz 5.7 GHz Different Cores Intel 80-core Processor Bit Flips Aging Wearout Performance Loss [adapted from S. Dighe et al., ISSCC’10] On-chip Buffer Accumulators PE Control Unit PE PE Processing Array ... PE PE PE ... ... ... ... Off-Chip Memory Hardware for DNN Inference PE PE PE ... Expected Output Output with Faults Stop Sign 40 km/h DNN Input For VGG11 with ImageNet dataset, our SalvageDNN method takes ≈2 mins on a CPU while one epoch of re-training takes ≈100 mins with a GPU The method offers 1.5x PDP reduction using the proposed MAC designs with no accuracy loss [Image Source: Google Images]

Upload: others

Post on 09-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Robust and Energy-Efficient

    Deep Learning SystemsMuhammad Abdullah Hanif1 (Ph.D. Candidate), Muhammad Shafique2 (Advisor)

    1Technische Universität Wien (TU Wien), Vienna, Austria2Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, UAE

    PHD FORUM

    Selected Publications[DAC’17] M. A. Hanif, R. Hafiz, O. Hasan, M. Shafique, “QuAd: Design and analysis of quality-area optimal

    low-latency approximate adders”, Design Automation Conference (DAC), pp. 1–6, 2017.

    Received a HiPEAC Paper Award.

    [DATE’18] M. A. Hanif, R. Hafiz, M. Shafique, “Error resilience analysis for systematically employing approximate

    computing in convolutional neural networks”, DATE, pp. 913–916, 2018..

    [DAC’19] M. A. Hanif, F. Khalid, M. Shafique, “CANN: Curable approximations for high-performance deep neural

    network accelerators”, DAC, pp. 1–6, 2019. Received a HiPEAC Paper Award.

    [DAC’20] M. A. Hanif, R. Hafiz, O. Hasan, M. Shafique, “PEMACx: A probabilistic error analysis methodology for

    adders with cascaded approximate units”, DAC, pp. 1–6, 2020. Received a HiPEAC Paper Award.

    [DATE’21] M. A. Hanif, M. Shafique, “DNN-Life: An energy-efficient aging mitigation framework for improving

    the lifetime of on-chip weight memories in deep neural network hardware architectures”, DATE,

    pp. 1-6, 2021.

    Overview of Our MethodologyProblems and Motivation

    ❑ CANN: Curable Approximations [DAC’19]❑ SalvageDNN: Fault-Aware Mapping [RSTA’20] ❑ HW/SW Approximations [JOLPE’18]

    [JOLPE’18] M. A. Hanif, A. Marchisio, T. Arif, R. Hafiz, S. Rehman, M. Shafique, “X-DNNs: Systematic

    cross-layer approximations for energy-efficient deep neural networks”, Journal of Low Power

    Electronics (JOLPE), vol. 14, no. 4, pp. 520–534, 2018.

    [IOLTS’18] M. A. Hanif, F. Khalid, R. V. W. Putra, S. Rehman, M. Shafique, “Robust machine learning

    systems: Reliability and security for deep neural networks”, IEEE International Symposium on

    On-Line Testing and Robust System Design (IOLTS), pp. 257–260, 2018.

    [IOLTS’20] M. A. Hanif, M. Shafique, “Dependable deep learning: Towards cost-efficient resilience of deep

    neural network accelerators against soft errors and perma-nent faults”, IOLTS, pp. 1–4, 2020.

    [RSTA’20] M. A. Hanif, M. Shafique, “SalvageDNN: Salvaging deep neural network accelerators with

    permanent faults through saliency-driven fault-aware map-ping,” Philosophical Transactions of

    the Royal Society A (RSTA), vol. 378, no. 2164, pp. 20190164, 2020.

    dblp link: https://dblp.org/pid/191/2451.html

    Datasets

    Hardware-Induced Reliability Threats

    Soft Errors Permanent FaultsAging

    DNN

    Architecture/

    Pre-trained

    DNNs

    DNN

    Hardware

    Architecture

    Curable Approximations [DAC’19]

    Error Resilience Analysis of

    DNNs [DATE’18, IOLTS’18]

    Analytical Methods for Error

    Estimation of Approximate

    Modules [DAC’17, DAC’20]

    Design Space Exploration of

    Approximate Modules[DAC’17, DAC’20]

    Designer-Induced Faults for

    Energy-Efficient DNN

    Inference [JOLPE’18]

    Pruning + Quantization

    Functional Approximations

    Fault-Aware

    Mapping [RSTA’20]

    Fault-Aware

    Training

    Range Restriction-

    based Soft Error

    Mitigation [IOLTS’20]

    Software-level Techniques for

    Mitigating Reliability Threats [IOLTS’20]

    Fault-Aware

    Pruning

    Optimized

    DNNs

    Optimized

    DNN

    Hardware

    Robust and Energy-

    Efficient DNN Inference

    Permanent-Fault

    Mitigation Support [RSTA’20]

    Aging Mitigation [DATE’21]

    Soft Error Mitigation

    Support

    Hardware Modifications

    for Mitigating Reliability

    Threats [IOLTS’20]

    ❑ Applications

    𝐴𝑐𝑐𝑢𝑟𝑎𝑡𝑒 𝑆𝑦𝑠𝑡𝑒𝑚Stage NStage 2

    Accu.Module

    …𝐼𝑛𝑝𝑢𝑡𝑂1 𝑂2

    Accu.Module

    Accu.Module

    𝑂𝑁

    Stage 1

    𝐴𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝑆𝑦𝑠𝑡𝑒𝑚Stage NStage 2Stage 1

    Approx.Module

    𝐼𝑛𝑝𝑢𝑡

    ≈ 𝑂1 ≈ 𝑂2Approx.Module

    Approx.Module

    𝑂𝑁 + 𝜖…

    𝑃𝑟𝑜𝑝𝑜𝑠𝑒𝑑 𝑆𝑦𝑠𝑡𝑒𝑚Stage N+1

    CuModule

    𝑂𝑁

    Stage 2

    𝑂2 + 𝜖2

    𝑓(𝜖2)

    C&DAxModule

    Stage N

    𝑓(𝜖𝑁)

    C&DAxModule

    𝑂𝑁 + 𝜖𝑁

    𝑓(𝜖1)

    Stage 1

    DAxModule

    𝐼𝑛𝑝𝑢𝑡

    𝑂1 + 𝜖1

    Proposed Array

    PE PE PE PE

    PE PE PE PE

    PE PE PE PE

    PE PE PE PE

    Cu Cu Cu Cu

    Act

    ivat

    ion

    s

    Outputs

    Weights

    Conventional Array

    PE PE PE PE

    PE PE PE PE

    PE PE PE PE

    PE PE PE PE

    Outputs

    Weights

    Act

    ivat

    ion

    s

    PE

    PE Cu

    LEGEND:PE with DAx MAC

    PE with C&DAx MAC

    Curing Module

    PEPE with Accurate MAC

    LEGEND:Un-complemented partial product bits

    1’b1 Discarded bitsPartial sum bits

    Complemented partial product bits

    Error bit from previous stage

    𝑓(𝜖) Error signal to the next stage

    Accurate C&DAx

    Proposed MAC Designs

    Step 1

    Step 4

    Step 5

    𝑓(𝜖)Addition Addition

    Step 5

    Step 1

    Step 4

    Addition

    … …

    Error bit from the previous

    stage

    70

    80

    90

    100

    0 10 20 30 40 50 60 70 80 90 100Te

    st A

    ccura

    cy

    [%ag

    e]

    Model Size Reduction [%age]

    a b c de

    1. Structured Pruning (VGG11, Cifar10)

    0

    50

    100

    4 5 6 7 8 9 10Test

    Acc

    ura

    cy

    [%ag

    e]

    Bit Width

    abcde

    2. Uniform Quantization (VGG11, Cifar10)

    Accuracy starts decreasing

    at the same bit-width

    0

    50

    100

    Accurate Config. 1 Config. 2 Config. 3 Config. 4 Config. 5

    Multiplier Configuration

    Test

    Acu

    racy

    [%

    age]

    abcde

    3. Hardware Approximation (VGG11, Cifar10, 8-bit)

    PDP [fJ] 88.40 83.66 77.71 70.60 70.90 67.40

    MSE 0 0.3 9.8 266.3 3102.3 24806

    Filter 1𝑊11 = −0.63 𝑊12 = 0.25

    𝑊21 = −0.22 𝑊22 = 0.17

    Filter 2𝑊11 = 0.25 𝑊12 = −0.41

    𝑊21 = −0.43 𝑊22 = 0.37

    Filter 3𝑊11 = 0.59 𝑊12 = −0.39

    𝑊21 = 0.45 𝑊22 = 0.67

    Filter 4𝑊11 = −0.71 𝑊12 = 0.55

    𝑊21 = 0.34 𝑊22 = −0.66

    Conventional Mapping of filters

    onto a Faulty Systolic Array

    -0.63 0.25 0.59 -0.71

    0.25 -0.41 -0.39 0.55

    -0.22 -0.43 0.45 0.34

    0.17 0.37 0.67 -0.66

    𝑊11

    𝑊12

    𝑊21

    𝑊22

    Fault-Aware Mapping

    0.25 -0.63 -0.71 0.59

    -0.41 0.25 0.55 -0.39

    -0.43 -0.22 0.34 0.45

    0.37 0.17 -0.66 0.67

    𝑊11

    𝑊12

    𝑊21

    𝑊22

    Saliency Computation of

    the weights of the DNN

    Rearrangement of the

    Parameters to avoid

    extra run time overheads

    Hardware

    Modification

    Dataflow/

    Mapping

    Scheme

    DNN

    Accelerator

    Post fabrication testing

    Fault Map of the

    Hardware

    Modified Network

    For each layer Optimized DNN

    AcceleratorMinimization of the Sum

    of Saliency of weights

    that must be pruned due

    to faulty MACs

    Deep Neural

    Network

    Techniques for Robust and Energy-Efficient Deep Neural Network Inference

    Natural

    Language

    Processing

    Image Classification,

    Object Detection &

    Localization, etc.

    ❑ Compute Cost [OPs]❑ AlexNet → 1.5B❑ VGGNet → 19.6B❑ Inception → 2B❑ ResNet-152 → 11B

    ❑ Design Methods to Improve Robustness and Energy-Efficiency of DNNs❑ Cost-effective techniques for mitigating reliability threats❑ HW/SW approximations for achieving high energy-efficiency

    Man

    ifesta

    tio

    nP

    hys

    ical

    Level

    Soft Errors Process VariationAging

    Core1 Core2

    Core9 Core10

    Core17 Core18

    Core25 Core26

    Core33 Core34

    Core41 Core42

    Core49 Core50

    Core57 Core58

    Core3 Core4 Core5 Core6 Core7 Core8

    Core11 Core12 Core13 Core14Core15 Core16

    Core19 Core20 Core21 Core22Core23 Core24

    Core27 Core28 Core29 Core30Core31 Core32

    Core35 Core36 Core37 Core38Core39 Core40

    Core43 Core44 Core45 Core46Core47 Core48

    Core51 Core52 Core53 Core54Core55 Core56

    Core59 Core60 Core61 Core62 Core63 Core64

    Faster Core

    SlowerCore

    DrainSource

    p+

    n – substrate

    Gate

    Oxide Layer

    Vg= – Vdd

    Si HTRA

    P OH+

    p+

    n+ n+

    P-Well

    P-Substrate

    Isolation

    Gate

    +-+-

    +-+- +-

    +- +- +-

    +-

    +-

    [adap

    ted fro

    m K

    . Kan

    g et al., A

    SP

    -DA

    C’0

    8]

    [adap

    ted fro

    m R

    . Bau

    man

    n, T

    DM

    R’0

    5]

    [adap

    ted fro

    m B

    . Rag

    hu

    nath

    an et al., D

    AT

    E’1

    3]

    0000 0000

    0000 1111

    1111 1111

    0000 1010

    1000 0010

    1100 1111

    0111 0111

    0100 0010

    R0

    R1

    R2

    R3 Time

    Failu

    rera

    te

    Wearout

    0 10 20 30 40 50 60 70 80

    Fre

    qu

    en

    cy

    (G

    Hz) 7

    6

    5

    4

    3

    2

    7.3 GHz

    5.7 GHz

    Different Cores

    Intel 80-core Processor

    Bit Flips Aging → Wearout Performance Loss

    [adap

    ted fro

    m S

    . Dig

    he

    et al., ISS

    CC

    ’10]

    On-chip

    Buffer

    Accumulators

    PE

    Control

    Unit PE PEProcessing Array

    ...

    PE PE PE...

    ...

    ...

    ...

    Off-Chip

    Memory

    Hardware

    for DNN

    Inference PE PE PE...

    Expected

    Output

    Output

    with Faults

    Stop Sign

    40 km/hDNN

    Input

    For VGG11 with

    ImageNet dataset, our

    SalvageDNN method

    takes ≈2 mins on a

    CPU while one epoch

    of re-training takes

    ≈100 mins with a GPU

    The method offers 1.5x

    PDP reduction using the

    proposed MAC designs

    with no accuracy loss

    [Imag

    e Source: G

    oogle Im

    ages]

    https://dblp.org/pid/191/2451.html