benchmarking and improving nn execution on dsp … zuzana...about it). in 2014 ieee international...

1
Neural Network Engine (NNE) NN is dominated by memory accesses not MAC operations 8-bit wordlength has little impact on model accuracy Using the three optimization techniques in NNE reduced: o Memory accesses by 5.6x compared to baseline xDSP o Energy costs by 5.5x compared to baseline xDSP o Memory required for parameters by 3x NNE always performs in a deterministic number of cycles for any arbitrary network Conclusion Results Deterministic number of cycles: 3. Two-step scaling Solves inefficient individual scaling of activations 3.1 Within a vector (when writing results) 3.2 Across a layer (when reading results in a next layer) L1 L2 L3 L4 *minimum number of required accesses. This number can grow depending on how many times previously computed values must be retrieved and shifted additionally **this number does not contain energy required for fetching and decoding the instructions, which would increase the energy costs even more Table 2: 24-bit DSP vs 8-bit NNE. Energy cost is estimated on a 45nm process [4] NNE Inference = processing of a one-second audio file Required MAC operations: 6,600/inference Total cycles: 7,295/inference Memory not accessed in only 45 cycles (0.62% of all cycles) Accuracy: 80.28% (4,890 audio files) L1 L2 L3 L4 Group Number of shifts/group Additional shifts Green 2 ? Blue 1 ? Orange 3 ? Group Number of shifts/group Additiona l shifts Green 2 3 - 2 = 1 Blue 1 3 - 1 = 2 Orange 3 0 Benchmarking and improving NN execution on DSP vs. custom accelerator for hearing instruments Zuzana Jelčicová*, Adrian Mardari*, Oskar Andersson*, Evangelia Kasapaki*, Jens Sparsø Demant A/S, Smørum (Denmark)*; Technical University of Denmark, Lyngby (Denmark) N = number of layers excluding an input layer O = the number of output neurons in the current layer A = number of inputs/activations to the layer NNE includes three optimizations: 1. Reduced wordlength 24 to 8-bit parameters (inputs, weights, biases) On-the-fly symmetric quantization (each layer individually) Insignificant loss in accuracy 2. Several MACs in parallel Input & output stationary processing Considerations: 96-bit memory interface and 8-bit data Table 1: Number of read operations (inputs and weights) needed for inference using 8-bit parameters. Biases are excluded as they are negligible. DSPs in general Datapaths with 16 and 24-bit elements Neural network (NN) inference 8 bits are usually sufficient o More frequent overflows additional memory accesses, increased energy costs, and unpredictable timing behavior. xDSP vs NN processing engine (NNE) xDSP Oticon’s DSP-based platform used for obtaining baseline results NNE achieves further power optimizations by exploiting three mutually dependent techniques: 1. Reduced wordlength – 24 to 8 bits with insignificant loss in accuracy 2. Several MACs in parallel – reduced wordlength processing of more data at once 3. Two-step scaling eliminates the need to reload and scale already computed outputs to maintain the ratio makes our NNE always execute in a deterministic number of cycles Introduction Feedforward fully connected DNN model (250x144x144x144x12) [1] trained on 12 words (first two categories represent silence and unknown words): "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go“ ReLU and softmax (output layer) activation functions Dataset: 65,000 one-second utterances of 30 words [2] Input to DNN is the flattened feature matrix Input signal (length L) is framed into overlapping frames (length l=40ms) with a stride (s=40ms), giving 25 frames as input/inference 10 frequency bins 250 inputs Keyword Spotting Application (KWS) Figure 1: KWS system consisting of a feature extractor and a NN based classifier Generic audio DSP baseline for comparisons 24 bits for representing data (Q5.19) SIMD4 (96-bit memory interface) Digital Signal Processor (xDSP) [1] https://github.com/ARM-software/ML-KWS-for-MCU [2] https://ai.googleblog.com/2017/08/launching-speech-commands- dataset.html [3] Quantization algorithms. https://nervanasystems.github.io/ distiller/algo_quantization.html [4] Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (Feb 2014), pp. 10–14 References Figure 2: System overview diagram for the NNE [3]

Upload: others

Post on 04-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarking and improving NN execution on DSP … Zuzana...about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (Feb 2014), pp

Neural Network Engine (NNE)

• NN is dominated by memory accesses not MAC operations

• 8-bit wordlength has little impact on model accuracy

Using the three optimization techniques in NNE reduced:

oMemory accesses by 5.6x compared to baseline xDSP

oEnergy costs by 5.5x compared to baseline xDSP

oMemory required for parameters by 3x

• NNE always performs in a deterministic number of cycles for any arbitrary network

Conclusion

Results

Deterministic number of cycles:

3. Two-step scaling• Solves inefficient individual scaling of activations

3.1 Within a vector (when writing results)

3.2 Across a layer (when reading results in a next layer)

L1 L2 L3 L4

*minimum number of required accesses. This number can grow depending on how many times previously computed values must be retrieved and shifted additionally

**this number does not contain energy required for fetching and decoding the instructions, which would increase the energy costs even more

Table 2: 24-bit DSP vs 8-bit NNE. Energy cost is estimated on a 45nm process [4]

NNE

• Inference = processing of a one-second

audio file

• Required MAC operations: 6,600/inference

• Total cycles: 7,295/inference

• Memory not accessed in only 45 cycles

(0.62% of all cycles)

• Accuracy: 80.28% (4,890 audio files)

L1 L2 L3 L4

GroupNumber of

shifts/groupAdditional

shifts

Green 2 ?

Blue 1 ?

Orange 3 ?

GroupNumber of

shifts/groupAdditiona

l shifts

Green 2 3 - 2 = 1

Blue 1 3 - 1 = 2

Orange 3 0

Benchmarking and improving NN execution on DSP vs.

custom accelerator for hearing instrumentsZuzana Jelčicová*, Adrian Mardari*, Oskar Andersson*, Evangelia Kasapaki*, Jens Sparsø✺

Demant A/S, Smørum (Denmark)*; Technical University of Denmark, Lyngby (Denmark)✺

N = number of layers excluding an input layer

O = the number of output neurons in the current layer

A = number of inputs/activations to the layer

NNE includes three optimizations:

1. Reduced wordlength • 24 to 8-bit parameters (inputs, weights, biases)

• On-the-fly symmetric quantization (each layer individually)

• Insignificant loss in accuracy

2. Several MACs in parallel•Input & output stationary processing

•Considerations: 96-bit memory interface and 8-bit data

Table 1: Number of read operations (inputs and weights) needed for inference using 8-bit parameters.

Biases are excluded as they are negligible.

DSPs in general

• Datapaths with 16 and 24-bit elements

Neural network (NN) inference

• 8 bits are usually sufficient

o More frequent overflows → additional memory accesses,

increased energy costs, and unpredictable timing behavior.

xDSP vs NN processing engine (NNE)

xDSP → Oticon’s DSP-based platform used for obtaining baseline

results

NNE achieves further power optimizations by exploiting three mutually

dependent techniques:

1. Reduced wordlength – 24 to 8 bits with insignificant loss in

accuracy

2. Several MACs in parallel – reduced wordlength → processing

of more data at once

3. Two-step scaling

• eliminates the need to reload and scale already computed

outputs to maintain the ratio

• makes our NNE always execute in a deterministic number

of cycles

Introduction

• Feedforward fully connected DNN model (250x144x144x144x12) [1]

trained on 12 words (first two categories represent silence and

unknown words): "yes", "no", "up", "down", "left", "right", "on", "off",

"stop", "go“

• ReLU and softmax (output layer) activation functions

• Dataset: 65,000 one-second utterances of 30 words [2]

• Input to DNN is the flattened feature matrix

• Input signal (length L) is framed into overlapping frames (length

l=40ms) with a stride (s=40ms), giving 25 frames as input/inference

• 10 frequency bins → 250 inputs

Keyword Spotting Application (KWS)

Figure 1: KWS system consisting of a feature extractor and a NN based classifier

• Generic audio DSP→ baseline for comparisons

• 24 bits for representing data (Q5.19)

• SIMD4 (96-bit memory interface)

Digital Signal Processor (xDSP)

[1] https://github.com/ARM-software/ML-KWS-for-MCU

[2] https://ai.googleblog.com/2017/08/launching-speech-commands-

dataset.html

[3] Quantization algorithms. https://nervanasystems.github.io/

distiller/algo_quantization.html

[4] Horowitz, M. 1.1 computing’s energy problem (and what we can do

about it). In 2014 IEEE International Solid-State Circuits Conference

Digest of Technical Papers (ISSCC) (Feb 2014), pp. 10–14

References

Figure 2: System

overview diagram

for the NNE

[3]