benchmarking and improving nn execution on dsp … zuzana...about it). in 2014 ieee international...

Neural Network Engine (NNE)

• NN is dominated by memory accesses not MAC operations

• 8-bit wordlength has little impact on model accuracy

Using the three optimization techniques in NNE reduced:

oMemory accesses by 5.6x compared to baseline xDSP

oEnergy costs by 5.5x compared to baseline xDSP

oMemory required for parameters by 3x

• NNE always performs in a deterministic number of cycles for any arbitrary network

Conclusion

Results

Deterministic number of cycles:

3. Two-step scaling• Solves inefficient individual scaling of activations

3.1 Within a vector (when writing results)

3.2 Across a layer (when reading results in a next layer)

L1 L2 L3 L4

*minimum number of required accesses. This number can grow depending on how many times previously computed values must be retrieved and shifted additionally

**this number does not contain energy required for fetching and decoding the instructions, which would increase the energy costs even more

Table 2: 24-bit DSP vs 8-bit NNE. Energy cost is estimated on a 45nm process [4]

NNE

• Inference = processing of a one-second

audio file

• Required MAC operations: 6,600/inference

• Total cycles: 7,295/inference

• Memory not accessed in only 45 cycles

(0.62% of all cycles)

• Accuracy: 80.28% (4,890 audio files)

L1 L2 L3 L4

GroupNumber of

shifts/groupAdditional

shifts

Green 2 ?

Blue 1 ?

Orange 3 ?

GroupNumber of

shifts/groupAdditiona

l shifts

Green 2 3 - 2 = 1

Blue 1 3 - 1 = 2

Orange 3 0

Benchmarking and improving NN execution on DSP vs.

custom accelerator for hearing instrumentsZuzana Jelčicová*, Adrian Mardari*, Oskar Andersson*, Evangelia Kasapaki*, Jens Sparsø✺

Demant A/S, Smørum (Denmark)*; Technical University of Denmark, Lyngby (Denmark)✺

N = number of layers excluding an input layer

O = the number of output neurons in the current layer

A = number of inputs/activations to the layer

NNE includes three optimizations:

1. Reduced wordlength • 24 to 8-bit parameters (inputs, weights, biases)

• On-the-fly symmetric quantization (each layer individually)

• Insignificant loss in accuracy

2. Several MACs in parallel•Input & output stationary processing

•Considerations: 96-bit memory interface and 8-bit data

Table 1: Number of read operations (inputs and weights) needed for inference using 8-bit parameters.

Biases are excluded as they are negligible.

DSPs in general

• Datapaths with 16 and 24-bit elements

Neural network (NN) inference

• 8 bits are usually sufficient

o More frequent overflows → additional memory accesses,

increased energy costs, and unpredictable timing behavior.

xDSP vs NN processing engine (NNE)

xDSP → Oticon’s DSP-based platform used for obtaining baseline

results

NNE achieves further power optimizations by exploiting three mutually

dependent techniques:

1. Reduced wordlength – 24 to 8 bits with insignificant loss in

accuracy

2. Several MACs in parallel – reduced wordlength → processing

of more data at once

3. Two-step scaling

• eliminates the need to reload and scale already computed

outputs to maintain the ratio

• makes our NNE always execute in a deterministic number

of cycles

Introduction

• Feedforward fully connected DNN model (250x144x144x144x12) [1]

trained on 12 words (first two categories represent silence and

unknown words): "yes", "no", "up", "down", "left", "right", "on", "off",

"stop", "go“

• ReLU and softmax (output layer) activation functions

• Dataset: 65,000 one-second utterances of 30 words [2]

• Input to DNN is the flattened feature matrix

• Input signal (length L) is framed into overlapping frames (length

l=40ms) with a stride (s=40ms), giving 25 frames as input/inference

→

• 10 frequency bins → 250 inputs

Keyword Spotting Application (KWS)

Figure 1: KWS system consisting of a feature extractor and a NN based classifier

• Generic audio DSP→ baseline for comparisons

• 24 bits for representing data (Q5.19)

• SIMD4 (96-bit memory interface)

Digital Signal Processor (xDSP)

[1] https://github.com/ARM-software/ML-KWS-for-MCU

[2] https://ai.googleblog.com/2017/08/launching-speech-commands-

dataset.html

[3] Quantization algorithms. https://nervanasystems.github.io/

distiller/algo_quantization.html

[4] Horowitz, M. 1.1 computing’s energy problem (and what we can do

about it). In 2014 IEEE International Solid-State Circuits Conference

Digest of Technical Papers (ISSCC) (Feb 2014), pp. 10–14

References

Figure 2: System

overview diagram

for the NNE

[3]

benchmarking and improving nn execution on dsp … zuzana...about it). in 2014 ieee international...

Documents